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Glossary  of  Special  Terms  and  Symbols 


a-value  = 
AS  I 

b-value  = 
BME 

c-value  = 
CRT 

d-value  = 
d.f. 

E 

e = 

exp( ) 


f.f. 

1(9) 

1(9, u)  = 
ICC 

I IF 
IRF 
IRT 

KR-20  = 

>0,1.7)= 
L ( 8| U)  = 
L ( U 1 9 ) = 

m = 

MAPL 

MLE 

N(0,1)  = 

p-value  = 

P. 

Qi 


# of  alternatives  in  a multiple 
choice  question 
discrimination  index 
Alternative  Similarity  Index 
difficulty  index 
Bayesian  modal  Estimation 
pseudo-guessing  index 
Cathode  Ray  Tube  device 
point  biserial  correlation 
distribution  function,  an  ogive. 
Error  score 

base  of  natural  logarithm 
e raised  to  the  power  of  whatever 
is  in  the  parenthesis  after  the 
exp 

frequency  function,  bell  shaped 
curve 

Test  Information  Curve 
Test  Information  Function 
Item  Characteristic  Curve,  same 
as  IRF 

Item  Information  Function,  1(9, u) 
Item  Response  Function 
Item  Response  Theory 
Kuder-Richardson  Formula  20 
Likelihood 

Logistic  Frequency  Function 
Likelihood  of  9,  given  U 
Likelihood  of  U,  given  9 
slope  of  the  ogive  at  the  b-value 
Minimum  Acceptable  Performance 
Level 

Maximum  Likelihood  Estimation 
Normal  f.f. 

proportion  of  examinees  selecting 

an  item  alternative 

P^(9)  = Probability  of  getting 

item  correct,  given  0 

Q.(0)  = Probability  of  getting 

item  wrong,  given  9 
item  biserial  correlation 

interitem  tetrachoric  correlation 


= reliability  of  classical  test 
theory 

= Relative  Efficiency  Curve,  ratio 


of  TIC's 


SD  = standard  deviation 

SEE  = Standard  Error  of  Estimate 

SIC  = Score  Information  Curve 

SME  = Subject  Matter  Expert 

T = True  score,  Observed  score  - 

Error 

TIC  = Test  Information  Curve,  1(0), 
£1(9, u) 

USCSC  = U.S.  Civil  Service  Commission 

U = response  vector,  response 

pattern 

u = response,  u-  = 1 if  response 

is  correct  '&  u-  = 0 if  response 
is  wrong 

W(0)  = optimal  weight  of  an  item 

X = Observed  score 

X = Mean 

9 = Theta,  the  ability  scale 


/ 

' f 

i 

I 

TT 


= Integral  sign 
= Psi , logistic  ogive 
= Phi,  normal  ogive 
= Summation  of  a series  of  numbers 
= Product  of  a series  of  numbers 
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of  special  terms  and  symbols 


! A 

= # of  alternatives  in  a multiple 

choice  question 

. a-value 

= discrimination  index 

. AS  I 

= Alternative  Similarity  Index 

. b-value 

= difficulty  index 

. BME 

= Bayesian  modal  Estimation 

. c-value 

= pseudo-guessing  index 

. CRT 

= Cathode  Ray  Tube  device 

. d- value 

= point  bi serial  correlation 

. d.f. 

= distribution  function,  an  ogive 

. E 

= Error  score 

. e 

= base  of  natural  logarithm 

• exp() 

= e raised  to  the  power  of  whatever 

is  in  the  parenthesis  after  the 

exp 

1 f.f. 

= frequency  function,  bell  shaped 

curve 

• 1(6) 

= Test  Information  Curve 

CD 

S- 

• 1(9, u) 

= Item  Information  Function 

CD 

n= 

. ICC 

= Item  Characteristic  Curve,  same 

s- 

as  IRF 

f 0 
CD 

! I IF 

= Item  Information  Function,  1(0, u) 

h- 

. IRF 

= Item  Response  Function 

TU 

C 

. IRT 

= Item  Response  Theory 

( 0 

. KR-20 

= Kuder-Richardson  Formula  20 

“O 

. L 

= Likelihood 

o 

Ll_ 

• L ( 0 , 1 . 7 ) 

= Logistic  Frequency  Function 

• L ( 0 1 U ) 

= Likelihood  of  0,  given  U 

• L(U|0) 

= Likelihood  of  U,  given  0 

. m 

= slope  of  the  ogive  at  the  b-value 

. MAPL 

= Minimum  Acceptable  Performance 

Level 

• MLE 

= Maximum  Likelihood  Estimation 

• N(0,1) 

= Normal  f.f. 

• p- value 

= proportion  of  examinees  selecting 

an  item  alternative 

. P. 

l 

= P^(0)  = Probability  of  getting 

item  correct,  given  0 

= Q.j(0)  = Probability  of  getting 

item  wrong,  given  0 

‘ rg0 

= item  bi serial  correlation 

• rgh 

= interitem  tetrachoric  correlation 

• rYY 

= reliability  of  classical  test 

XX 

theory 

• REC 

^Relative  Efficiency  Curve,  ratio 

of  TIC's 


over 
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SD  = standard  deviation 

SEE  = Standard  Error  of  Estimate 

SIC  = Score  Information  Curve 

SME  = Subject  Matter  Expert 

T = True  score,  Observed  score  - 

Error 

TIC  = Test  Information  Curve,  1(0), 

XI(0,u) 

USCSC  = U.S.  Civil  Service  Commission 

U = response  vector,  response 

pattern 

u = response,  u-  = 1 if  response 

is  correct  & u^  = 0 if  response 
is  wrong 

W(9)  = optimal  weight  of  an  item 

X = Observed  score 

X = Mean 

9 = Theta,  the  ability  scale 

/ = Integral  sign 

V = Psi , logistic  ogive 

^ = Phi , normal  ogive 

£ = Summation  of  a series  of  numbers 

= Product  of  a series  of  numbers 


PREFACE 


One  year  ago  I had  never  heard  of  latent  trait  theory,  an  item 
characteristic  curve,  or  Fred  Lord.  On  my  first  reading  of  Lord  and 
Novick  (1968)  Chapters  16  and  17,  I understood  absolutely  nothing. 

After  several  hours  of  study  on  my  second  reading,  I finally  comprehended 
one  simple  equation.  During  the  next  several  months  I reread  parts 
of  Lord  and  Novick  as  many  as  20  times,  I taught  myself  some  differ- 
ential calculus,  integral  calculus,  mathematical  statistics,  probability 
theory  and  linear  algebra,  I attended  Fred  Lord's  course  in  Item 
Response  Theory  at  the  Educational  Testing  Service,  Princeton,  NJ, 
and  I read  several  publications  on  Item  Response  Theory. 

I have  now  gotten  to  the  point  where  I am  able  to  use  Item 
Response  Theory  for  nqy  purposes,  although  there  is  still  much  that  I 
do  not  understand. 

Upon  reflection,  I find  that,  as  is  true  in  many  sciences,  it  is 
not  necessary  to  fully  understand  the  theoretical  background  and 
mathematical  development  in  order  to  apply  the  results  of  the  model. 

It  is  widely  acknowledged  in  the  field  that  one  of  the  main 
reasons  that  item  response  theory  has  been  so  slow  to  catch  on  among 
testing  practitioners  is  the  mathematical  complexity  of  the  literature. 
Most  of  the  literature  is  written  with  language  and  notation  that  is 
standard  for  the  researchers.  However,  that  language  and  notation 
is  confusing  to  the  thousands  of  testing  practitioners,  whose  technical 
training  amounts  to  a couple  of  courses  in  statistics  and  tests  and 
measurement,  if  that  much.  On  the  other  hand,  many  of  the  concepts 
used  in  the  literature  are  not  difficult  to  understand,  if  explained 
in  less  esoteric  language  and  with  a few  examples. 


T 


Therefore,  it  became  my  resolve  that  no  testing  practitioner,  such 
as  I,  should  have  to  go  through  what  I went  through  in  order  to 
gain  a basic  understanding  of  item  response  theory.  The  purpose  of 
this  paper  •!s  to  fulfill  that  resolve. 

Since  very  little  of  this  paper  is  original  with  me,  by 
rights  there  should  be  a reference  for  nearly  every  sentence  or 
paragraph.  Such  complete  references,  however,  will  not  be  included 
because  they  would  be  out  of  place  for  a primer,  and  usually  not  of 
interest  to  the  novice.  My  primary  references  are  Lord  & Novick  (1968) 
and  Lord  (in  preparation).  Some  references  will  be  included  to  direct 
the  reader  to  more  thorough  and  detailed  explanations.  Other  refer- 
ences will  be  included  where  authoritative  support  is  deemed  desirable. 

A primer  is  necessarily  incomplete.  It  is  also  inaccurate  when 
it  contains  oversimplifications  which  apply  to  the  general  case,  but 
do  not  apply  to  extreme,  unusual,  or  uninteresting  cases.  This  paper 
will  be  guilty  of  such  generalities  and  rules  of  thumb. 

Other  excellent,  less  elementary  introductory  material  is  also 
available.  (See  Baker,  1977;  Hambleton  & Cook,  1977;  Sympson,  1977). 

I am  indebted  to  ENS  Debra  Cook,  ENS  Pamela  Crandall,  ENS  Charles 
Pastine,  and  LTJG  Larry  Young  for  their  assistance  in  the  analysis  of 
data. 


My  appreciation  for  the  many  suggestions  and  corrections  made  by 
the  several  readers  and  reviewers  is  gratefully  acknowledged.  They 
are:  John  A.  Burt,  Joseph  Cowan,  Myron  A.  Fischl,  Steven  Gorman,  Karen 
Jones,  Frederick  M.  Lord,  James  R.  McBride,  W.  Alan  Nicewander, 

Malcolm  J.  Ree,  and  James  B.  Sympson. 

I would  also  like  to  thank  YN2  Ron  Smith  for  his  excellent  art 
work,  and  Jim  Walls  for  his  systems  analysis  and  computer  pro- 
gramming. 


THOMAS  A.  WARM 
January  22,  1978 
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1 

CHAPTER  1 
INTRODUCTION 


1.1  Item  Response  Theory  (IRT)  is  the  most  significant  development 
in  psychometrics  in  many  years.  It  is,  perhaps,  to  psychometrics 
what  Einstein's  relativity  theory  is  to  physics.  I do  not  doubt  that 
during  the  next  decade  it  will  sweep  the  field  of  psychometrics.  It 
has  been  said  that  IRT  allows  one  to  answer  any  question  about  an 
item  (test  question),  a test,  or  an  examinee,  that  one  is  entitled  to 
ask.  Although  this  statement  is  somewhat  circular,  it  will  give  you 
an  idea  of  the  terrific  power  of  IRT  and  of  the  mathematical  estima- 
tion methods  involved. 

The  most  common  application  of  IRT  is  with  multiple-choice 
questions  in  an  ability  test.  That  use  will  be  the  thrust  of  this 
paper,  although  IRT  also  applies  as  well  to  free  response  (fill  in) 
items.  I make  no  distinction  between  ability  and  knowledge  testing. 
IRT  applies  to  tests  for  both.  Thus,  the  word  "ability"  will  be  used 
for  both  types  of  tests.  No  application  of  IRT  to  personality  or 
interest  testing  will  be  discussed. 

1.2  If  we  give  several  tests  in  the  same  subject  matter  area  to  a 
group  of  examinees,  we  find  that  in  general  the  same  examinees  score 
high  on  the  tests  and  the  same  examinees  score  low.  In  other  words, 
we  find  consistency  in  the  performance  of  examinees  on  the  different 
tests. 

To  explain  this  consistency  we  assume  that  there  is  something 
inside  the  examinees  that  causes  them  to  score  consistently.  We  call 
that  something  a mental  trait. 
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In  the  vernacular  the  word  "trait”  implies  an  innate,  inherited 
characteristic.  We  don't  necessarily  mean  that.  We  mean  only  that 
characteristic  of  the  examinee  that  causes  consistent  performance  on 
the  tests,  whatever,  if  anything,  it  is. 

Ho  one  has  found  a physical  referent  for  a rental  trait,  and  few 
really  expect  to.  It  is  sometimes  tempting  to  think  of  a trait  as 
having  a physical  referent  like  a brain  engram,  but  that  is  always 
unnecessary.  In  this  sense,  a trait  is  an  intervening  variable,  as 
opposed  to  a hypothetical  construct.  Since  the  mental  trait  has  no 
known  physical  referent,  it  is  never  observed  directly.  Therefore, 
it  is  called  a "latent"  trait. 

1.3  The  scale  of  the  latent  trait  is  traditionally  given  the  name  of 
the  Greek  letter  theta  (9).  I will  use  the  terms  theta,  ability  level, 
amount  of  trait,  and  amount  of  subject-matter-knowledge,  interchangeably. 
Theta  is  a continuum  from  minus  infinity  (-00)  to  plus  infinity  (+00). 

It  has  no  natural  zero  point  or  unit.  Therefore,  the  zero  point  and 
unit  are  often  taken  as  the  mean  and  standard  deviation,  respectively, 
of  some  reference  sample  of  examinees.  Thus,  values  of  0 usually  vary 
from  -3  to  +3,  but  may  be  observed  outside  that  range.  The  0s  of  a 
sample  need  not  be  distributed  normally. 

1.4  When  an  examinee  walks  into  a testing  room,  he  brings  with  him  his 
theta.*  The  purpose  of  the  test,  then,  is  to  measure  the  relative 
position  of  the  examinees  on  the  theta  scale.  The  test  interprets  the 
examinee's  theta  and  produces  a measurement  of  ability,  which  is  often 
the  raw  (number  right)  score.  The  test  is  the  measuring  instrument. 

Often  measurement  of  an  ability  with  a test  is  made  analogous  to 
measurement  of  height  with  a tape  rule.  But  there  is  an  important 
difference.  Height,  whether  measured  by  an  English  rule  or  metric  rule, 
is  always  on  an  equal  interval  scale.  Histograms  of  a group  of  people 
will  always  look  the  same,  except  for  some  linear  stretching  of  a 
scale. 


*The  generic  masculine  pronouns  will  be  used  for  convenience. 
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That  is  not  the  case  with  testing.  The  histograms  of  raw  scores 


of  the  same  people  on  two  tests  will  seldom  look  the  same,  even  with 
linear  stretching  of  a scale.  That  is  because  each  test  has  its  own 
peculiar  scale  (also  called  metric).  The  peculiarity  of  a test's 
metric  distorts  the  distribution  of  examinees.  Until  IRT  there  has 
been  no  way  to  identify  the  peculiar  scale  of  a test. 
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CHAPTER  2 


Classical  Test  Theory  vs.  Item  Response  Theory 

2.1  Classical  test  theory  has  been  developed  over  a period  of  many 
years.  Gulliksen  (1950)  is  an  excellent  presentation  of  classical  test 

theory. 

Most  testing  practitioners  use  classical  test  theory,  whether  they 
know  it  or  not.  The  basic  tools  of  most  testing  practitioners  are: 

a.  p-value  = proportion  of  examinees  selecting  an  item  alter- 
native (also  called  "item  difficulty"), 

b.  d-value  = point-biserial  correlation  between  the  item  al- 
ternative and  the  test  (some  use  the  biserial  correlation) (also  called 
"item  discrimination"), 

c.  mean  of  examinees'  (number  right)  scores, 

d.  standard  deviation  of  examinees'  scores, 

e.  skewness  and  kurtosis  of  examinees'  scores, 

f.  reliability  of  the  test,  usually  KR-20,  the  Kuder-Richardson 
Formula  20  (a  special  case  of  Cronbach's  coefficient  alpha). 

Anyone  whose  test  analysis  is  principally  based  on  the  statistics 
listed  above  is  using  classical  test  theory.  The  problem  with  those 
statistics  is  that  they  are  relative  to  the  characteristics  of  the  test 
and  of  the  examinees. 


The  p-value  is  relative  to  the  ability  level  of  the  examinees. 

The  sane  item  given  to  a high  ability  group  and  low  ability  group  will 
get  two  different  p- values  for  the  two  groups.  It  can  be  shown  that 
p-values  are  not  true  measures  of  relative  item  difficulty.  It  is  not 
uncommon  for  items  measuring  the  same  ability  to  reverse  the  order  of 
their  p-values  when  given  to  groups  of  different  average  ability.  For 
example,  item  A may  have  a higher  p-value  than  item  B for  one  group  of 
examinees,  but  have  a lower  p-value  than  item  B for  a different  group. 
This  effect  is  not  a matter  of  sampling  error. 

The  a-value  is  relative  to  the  homogeneity  of  the  ability  levels 

of  the  examinees  in  the  sample,  the  subject-matter  homogeneity  of  the 

items  in  the  test,  and  the  dispersion  of  p-values  of  items  in  the  test. 
The  same  item,  given  to  a group  of  examinees  who  are  similar  in  ability 
and  to  another  group  with  a wide  range  of  ability,  will  produce  two 
different  d-values  for  the  two  groups.  Similarly,  an  item  included  in 
a test  with  other  items  that  are  homogeneous  in  content  and  p-value 
will  get  a d-value  different  from  the  d-value  it  will  receive  in  a 
heterogeneous  test. 

The  mean,  standard  deviation,  skewness  and  kurtosis  will  also  vary 
according  to  the  characteristics  of  the  test  and  examinees. 

The  reliability  is  relative  to  the  standard  deviation  of  the  test, 

and  to  the  p-values  and  d-values  of  the  items  in  the  test,  all  of  which 

are  dependent  upon  the  particular  abilities  of  the  examinees  and  the 
characteristics  of  the  test. 

The  following  quote  gives  another  liability  of  using  classical 
test  theory  in  culture-fair  testing  studies: 

"It  can  be  shown  that  classical  parameters  (e.g.  p-value)  will 
generally  not  be  linearly  related  across  subgroups  of  a population. 
This  means  that  the  test  for  cultural  bias  using  classical  parameters 
can  lead  to  an  artifactual  detection  of  bias."  (Pine,  1977,  p . 40 ) 
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Clearly,  classical  test  tneory  statistics  are  meaningful  only  in 
an  extremely  limited  situation,  i.e.,  when  the  same  item  is  given  to 
the  same  population  as  part  of  strictly  parallel  tests.  Such  a situ- 
ation rarely  occurs.  Furthermore,  the  basic  precepts  and  definitions 
of  classical  test  theory  are  untestable,  i.e.  they  are  tautologies. 

They  are  simply  taken  as  true  without  any  way  to  empirically  determine 
their  relevance  to  reality.  Some  are  assumed  to  be  true  even  when  this 
does  not  appear  to  be  warranted.  Thus,  no  one  knows  if  the  classical 
test  model  applies  to  any  real  test. 

2.2  In  contrast  IRT  makes  possible  item  and  test  statistics  which  are 
dependent  neither  on  the  characteristics  of  the  examinees  nor  on  the 
other  items  in  the  test.  They  are  invariant.  With  the  item  statistics 
it  becomes  possible  to  describe  in  precise  terms  the  characteristics  of 
the  test  before  the  test  is  administered.  This  capability  allows  one  to 
construct  a test  that  is  highly  efficient  in  accomplishing  the  purpose 
of  the  test.  It  also  provides  an  extremely  powerful  tool  for  special 
studies,  such  as  item  cultural  bias. 

Moreover,  the  assumptions  of  IRT  are  explicit  and  have  the  po- 
tential of  empirical  testing.  It  is  possible  to  discover  if  the  data 
reasonably  meet  the  assumptions. 
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CHAPTER  3 


A Brief  History  of  Item  Response  Theory 

3.1  The  origin  of  latent  trait  theory  can  be  traced  to  Ferguson  (1942) 
and  Lawley  (1943).  Item  Response  Theory  is  just  one  of  several  models 
under  latent  trait  theory.  The  Rasch  model  is  another. 

3.2  Other  early  publications  using  some  of  the  same  concepts  are 
Brogden  (1946),  Tucker  (1946)  Carroll  (1950),  and  Cronbach  and  Warring- 
ton (1952). 

3.3  In  1952,  Lord  published  his  Ph.D.  dissertation  in  which  he  pre- 
sented IRT  as  a model  or  theory  in  its  own  right.  At  that  time  he 
called  it  Item  Characteristic  Curve  Theory.  Thus,  Lord  is  considered 
the  father  and  founder  of  IRT.  Shortly  after  publishing  his  disser- 
tation, Lord  stopped  work  on  IRT  for  ten  years,  due  to  a seemingly 
intractable  problem  with  it.* 

3.4  In  1960,  Rasch  (1960)  published  his  one-parameter  sample-free 
model.  The  Rasch  model  stirred  much  interest  and  considerable  work  was 
done  on  it  during  the  next  decade.  Its  leading  proponent  in  the  U.S. 
is  Benjamin  Wright,  a psychoanalyst  at  the  University  of  Chicago.  (See 
Wright,  1977  for  references). 


3.5  In  1965,  Lord  (1965)  conducted  a massive  study,  using  a sample 
size  of  greater  than  100,000.  That  study  showed  that  the  "problem", 
which  had  deterred  his  work  for  so  long,  was  not  really  a problem,  and 
that  IRT  was  appropriate  for  real  life  multiple-choice  tests.  With 
that  study  Lord  began  work  again  on  IRT. 


*This  problem  is  discussed  in  Section  14.2 
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3.6  In  1968,  Lord  and  Novick  published  a psychometrics  textbook, 
within  which  were  four  Chapters  (17-20)  by  Allan  Birnbaum  (196S),  a 
well-known  statistician  (now  deceased).  Birnbaum's  chapters  worked  out 
in  detail  the  mathematics  of  the  two  and  three  parameter  normal  ogive 
and  logistic  models.* 

3.7  Soon  thereafter  Urry  (1970)  completed  his  Ph.D  dissertation  in 
which  he  ccrpared  the  one,  two,  and  three  parameter  models.  He  con- 
cluded that  the  three  parameter  model  best  described  the  real  world  for 
multiple-choice  tests. 

3.8  Since  Urry's  dissertation,  much  work  has  been  done  on  all  three 
models  (i.e.,  one,  two,  and  three  parameter),  but  the  three  parameter 
model  is  now  receiving  most  of  the  attention  because  it  best  describes 
reality.  To  wit,  I shall  deal  with  the  3-parameter  model  only. 

3.9  Much  of  the  work  on  the  3-parameter  model  is  coming  from  3 pri- 
ncipal sources.  The  sources  are: 

a.  Frederic  M.  Lord,  Distinguished  Research  Scientist,  Educa- 
tional Testing  Service,  Princeton,  NJ. 

b.  Vern  W.  Urry,  Personnel  Research  Psychologist,  United  States 
Civil  Service  Commission,  Washington,  D.C. 

c.  David  J.  Weiss,  Prof,  of  Psychology,  Psychometric  Methods 
Program,  University  of  Minnesota,  Minneapolis,  MN. 

There  are,  of  course,  many  other  highly  productive  researchers 
publishing  excellent  studies.  Failure  to  include  them  in  this  list  is 
more  an  indication  of  my  limited  exposure  than  of  the  significance  of 
their  contributions. 


*The  normal  ogive  and  logistic  ogive  will  be  compared  briefly  in 
Chapter  4. 


20 


-TT 


3.10  The  United  States  Civil  Service  Commission  has  adopted  a pa- 
rticular application  of  IRT  as  official  policy.  The  five  U.S.  armed 
forces  (including  the  U.  S.  Coast  Guard)  are  also  investigating  the 
application  of  IRT. 

3.11  In  1977  Lord  changed  the  name  of  his  model  from  Item  Character- 
istic Curve  Theory  to  Item  Response  Theory. 


CHAPTER  4 


The  Normal  Ogive  and  Logistic  Ogive 

4.1  I trust  the  reader  will  recognize  the  normal  curve  plotted  in 
Figure  4.1  with  the  pluses  (++++).  It  has  a mean  =0,  and  standard 
deviation  =1.  The  formula  for  this  normal  curve  is  identified  in 
Figure  4.1  as  N(0,1 ) . 

4.2  A bell-shaped  curve  like  this  is  called  a frequency  function 
(f.f. ).  It  is  called  a frequency  function  even  when  the  ordinate 
(vertical  axis)  is  defined  as  frequency,  proportion,  percent,  or 
density  (Kendall  and  Stuart,  1977,  p.  13).  Therefore,  we  call  the 
normal  curve,  the  "normal  frequency  function." 

4.3  Superimposed  over  the  normal  f.f.  in  Figure  4.1  is  a logistic* 

curve  or  logistic  frequency  function,  plotted  with  dots  ( ). 

This  logistic  f.f.  also  has  a mean  =0  and  standard  deviation  1.0. 

The  formula  for  this  logistic  f.f.  is  identified  in  Figure  4.1  as 
L(0,1.7).  The  1.7  in  the  exponent  of  the  formula  is  chosen  to  allow 
the  logistic  f.f.  to  approximate  the  normal  f.f  as  closely  as  possible. 
The  actual  value  is  1.6679,  which  is  rounded  to  1.7.  In  some  of  the 
literature  the  1.7  is  represented  by  the  upper  case  letter  D.  The 
letter  e is  the  base  of  natural  logarithms;  e 2. 7182G1C28. 

4.4  The  reader  will  also  recognize  the  S-shaped  curve  in  Figure  4.4 
as  the  normal  cumulative  frequency  curve.  An  S-shaped  curve  is 
called  an  ogive.**  This  curve  gives  the  proportion  of  area  under  the 
normal  curve  (Figure  4.1)  that  lies  to  the  left  of  each  point  on  the 
abscissa  (horizontal  axis). 

*pronounced  lojistic 

**pronounced  ojive 


L (0,1.7)  frequency  functions. 
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4.5  An  ogive  like  this  is  called  a distribution  function  (d.f. ).  It 
is  called  a distribution  function  even  when  the  ordinate  is  defined  as 
cumulative  frequency,  cumulative  proportion,  cumulative  percent,  or 
cumulative  area  (Kendall  & Stuart,  1977,  p . 13 ) . Therefore,  we  call  the 
curve  in  Figure  4.4  a "normal  distribution  function,"  or  a "normal 
ogive".  The  formula  for  this  normal  d.f.  is  identified  in  Figure  4.4 
as /N(0,1). 

4.6  Also  in  Figure  4.4,  but  not  discernable,  is  the  logistic  ogive 
(or  logistic  d.f.)  for  the  logistic  f.f.  in  Figure  4.1.  It  is  not 
discernable,  because  it  is  so  close  to  the  normal  ogive  that  on  this 
scale  the  two  curves  merge  together  in  the  width  of  the  ink  line.  A 
small  portion  has  been  magnified  to  a larger  scale  (lOx),  so  that  the 
difference  may  be  seen.  The  magnified  area  was  chosen  at  the  place 
where  the  2 ogives  are  farthest  apart.  The  reader  can  verify  that  at 
any  point  on  the  abscissa  the  2 ogives  are  always  less  than  .01  apart 
on  the  ordinate,  as  is  indicated  by  the  inequality  under  the  magni- 
fication in  Figure  4.4.  The  formula  for  this  logistic  d.f.  is  id- 
entified in  Figure  4.4  as  J~L(0, 1.7). 

4.7  The  ogive  with  which  we  are  concerned  is  the  normal  ogive. 

However,  note  the  integral  sign  ( f ) on  the  right  side  of  the  de- 
finition for  the  f’  N(0,1). 

The  integral  sign  there  means  that  no  algebraic  function  can  be 
found  to  describe  the  normal  ogive.  This  fact  makes  the  normal  ogive 
very  cumbersome  to  work  with  mathematically,  and  requires  numerical 
methods  to  solve,  or  a table  of  values. 
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4.8  On  the  other  hand  the  logistic  ogive  has  no  integral  sign  on  the 
right  side  of  its  definition  ( f L(0,1.7)).  In  fact,  the  expression 
on  the  right  in  Figure  4.4  i_s  the  algebraic  function  describing  the 
logistic  ogive.  The  logistic  ogive  is  very  easy  to  work  with.* 

4.9  For  these  reasons  the  logistic  ogive  is  substituted  as  a con- 
venient and  very  close  approximation  to  the  norral  ogive. 

4.10  This  paper  will  only  deal  with  the  logistic  ogive.  Statements 
about  the  logistic  ogive  may  be  taken  as  close  approxi rations  to  the 
normal  ogive  model.  The  logistic  f.f.  is  no  longer  of  interest  to  us. 


*Some  interesting  logistic  identities  are  given  in  Appendix  A. 


CHAPTER  5 

More  About  Logistic  Ogives 

5.1  Figure  4.4  shows  just  one  logistic  ogive.  There  is  actually  an 
infinite  family  of  logistic  (and  normal)  ogives,  each  different  in 
some  way  from  every  other  one. 

5.2  Logistic  ogives  are  strictly  monotonic  functions.  They  are 
strictly  monotonic  because,  going  from  left  to  right,  the  ogive 
always  gets  higher  and  higher,  never  is  completely  horizontal,  and 
never  goes  down. 

5.3  Notice  the  ogive  in  Figure  4.4.  Between  -2.0  and  -0.5  on  the 
horizontal  axis  the  ogive  is  concave  upward.  Between  0.5  and  2.0  it 
is  concave  downward.  At  some  point  between  -0.5  and  0.5  this  ogive 
must  change  from  being  concave  upward  to  concave  downward.  That 
point  is  called  the  "inflection  point."  The  inflection  point  is 
always  the  point  where  the  slope  of  the  ogive  is  at  its  maximum.  The 
inflection  point  for  this  ogive  is  located  on  the  vertical  axis  at 
.50,  and  on  the  horizontal  axis  at  0.0. 
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5.4  Three-parameter  logistic  ogives  (with  which  we  are  exclusively 
concerned)  may  differ  from  each  other  in  only  3 ways,  one  for  each 
parameter. 


5.5  One  way  in  which  logistic  ogives  may  differ  is  in  the  horizontal 
location  of  the  inflection  point.  Figure  5.5  shows  3 logistic  ogives 
labeled  E,  F,  and  G with  their  inflection  points  at  different  places  on 
the  abscissa.  You  can  see  that  the  3 ogives  are  exactly  the  same 
except  for  a sideways  shift  of  the  entire  curve.  Shifting  the  inflec- 
tion point  sideways,  shifts  the  entire  ogive  sideways.  The  horizontal 
position  of  the  inflection  point  is  called  the  "b-parameter" . Some 
call  it,  as  we  will,  the  "b-value".  The  b-values  of  ogives  E,  F,  and  G 
in  Figure  5.5  are  -.5,0.0  and  1.0,  respecti vely. 

5.6  To  include  the  b-parameter  in  the  logistic  ogive  function,  it  is 
only  necessary  to  subtract  the  b-parameter  from  the  horizontal  axis 
variable. 

5.7  Figures  4.1,  4.4,  and  5.5  were  constructed  with  the  horizontal 
axis  labeled  z.  This  label  was  chosen  to  facilitate  understanding  of 
the  logistic  f.f  and  d.f. , because  of  the  reader's  likely  familiarity 
with  the  traditional  z-scores  of  measurement.  Since  we  are  concerned 
with  the  ability  scale  called  9,  we  now  and  hereafter  label  the  hor- 
izontal axis,  0.  Substituting  0 for  z in  the  logistic  function 

and  subtracting  the  b parameter,  gives  the  height  of  the  logistic 
ogive  by  the  function 


y(9).[ue-1-7<e-b>Jl 
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which  is  sometimes  written 


¥(0)»[l  + expH.7(e-b))]' 


where  exp  means  e raised  to  the  power  of  whatever  is  in  the  paren- 
thesis after  the  exp.  The  upper  case  Greek  letter  psi  (4^)  1s  used 
in  the  literature  to  mean  the  logistic  ogive.  Phi  (^)  is  used  to 
mean  the  normal  ogive. 

5.8  The  logistic  ogive  has  2 asymptotes.  The  asymptotes  are  horizontal 
lines  that  the  ogive  approaches  at  its  extremes,  but  never  quite 
reaches.  The  upper  asymptote  is  located  on  the  vertical  axis  at 

1.00.  In  Figures  4.4  and  5.5  you  can  see  that  the  upper,  right  part 
of  the  logistic  ogives  approach  the  value  of  1.00  on  the  vertical 
axis.  In  the  figures  it  may  appear  as  though  they  touch  the  hori- 
zontal line  at  1.00,  but,  strictly  speaking,  they  never  quite  do. 

5.9  The  lower  asymptotes  for  the  ogives  in  Figures  4.4  and  5.5  is 
the  horizontal  axis  with  a height  of  zero.  Just  as  the  upper  part  of 
the  ogive  never  quite  reaches  1.00,  the  lower  part  of  the  ogive  never 
quite  reaches  the  lower  asymptote. 

5.10  All  logistic  ogives  in  IRT  have  an  upper  asymptote  at  1.00,  but 
not  all  have  a lower  asymptote  at  .00.  In  fact,  few  do. 

5.11  Figure  5.11  shows  3 logistic  ogives,  labeled  H,  J,  and  K,  which 
are  identical  except  for  different  lower  asymptotes.  The  lower 
asymptotes  are  at  .15,  .25,  and  .30  on  the  vertical  axis.  The 
b-value  for  each  ogive  = 0.0.  Note  that  the  upper  asymptote  for  all 
3 ogives  is  at  1.00. 

5.12  Note  also  that  the  inflection  points  (all  located  at  0.0  on  the 
9 scale)  for  the  ogives  in  Figure  5.11  are  at  different  heights.  In 
fact,  they  are  half-way  between  their  asymptotes.  That  is  always  the 
case.  The  inflection  point  of  the  logistic  ogive  is  always  half-way 
between  its  upper  and  lower  asymptotes. 
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5.13  The  lower  asymptote  is  called  the  c-paraneter  or  the  c-value.  It 
is  another  of  the  3 parameters  of  IRT. 

5.14  The  effect  of  the  c-value  is  to  squeeze  the  ogive  into  a smaller 
vertical  range.  The  reduced  range  is  equal  to  1 - c.  The  effect  of 
the  reduced  vertical  range  is  to  reduce  the  slope  of  the  ogive  at  every 
point  on  the  0 scale,  other  things  being  equal.  We  include  the  c- 
parameter  in  the  logistic  function  by  multiplying  by  1 - c,  and  adding 


^/(0)=c+(l-c)[n-el-7(0_b)J  1 


which  is  the  same  as 
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5.15  The  third  (and  last)  parameter  of  IRT  is  (you  guessed  it)  the 
a-parameter,  or  a-value. 


5.16  The  a-parameter  is  related  to  the  slope  of  the  either  ogive  at 
the  inflection  point  or  in  other  words  at  the  b-value.  For  the  normal 
ogive  model  (with  c = 0.0) 


a* 


m ~ 2.5m 


where  m is  the  slope  of  the  ogive  at  the  b-value. 


5.17  Figure  5.17  shows  3 logistic  ogives  (L,M,&N),  which  are  identical 
except  for  their  a-values  = .3,  .8  and  2.0,  respectively,  with  b = 0.0 
and  c = .00.  As  you  can  see,  the  larger  the  a-value,  the  steeper  the 
ogive.  Specifically, 

c^Vb]'1 

where  r>)=  the  point  on  0,  where  the  height  of  the  ogive  = c + .S455(l-c). 
The  -1  that  looks  like  an  exponent  of¥'$  is  not  an  exponent  at  all, 
but  indicates  the  inverse  of  the  function.  Typically,  a function  is 
used  by  starting  at  some  point  on  the  abscissa,  going  vertically  to  the 
function,  and  then  horizontally  to  the  ordinate.  The  inverse  procedure 
would  be  to  start  at  a point  on  the  ordinate  (in  this  case  at  c + 
.8455(l-c)),  go  horizontally  to  the  function,  and  then  drop  down  to  the 
abscissa  (0).  That  point  on  0 is^YA).  The  -1  outside  the  brackets 
is  an  exponent,  which  means  to  take  the  reciprocal.  The  number  .8455 
is  the  proportion  of  area  under  the  logistic  f.f.  and  to  the  left 
of  z-score  = 1 (see  Figure  4.1).  The  z-score  = 1 is  an  arbitrary 
mathematically  convenient  point. 
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5.18  The  a-parameter  enters  the  logistic  function  as  part  of  the 
exponent  of  e. 


-1.7a  (9-b) 

I + e 

This  formula  is  the  3-parameter  logistic  ogive.  It  will  look 
rather  ominous  to  the  novice.  However,  it  is  not  difficult  with  a 
pocket  calculator  with  an  ex  key  and  a 1/x  key.  It  is  highly  instru- 
ctive to  go  through  the  calculation  of  several  points  of  a typical 
logistic  ogive  and  to  plot  them.  An  opportunity  to  do  so  is  provided 
below  for  an  ogive  with  a =.9,  b = -.4,  and  c = .2.  The  reader  can 
verify  the  results  in  Figure  5.18,  which  shows  this  logistic  ogive  with 
its  characteristic  parts  labeled. 
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Pocket  Calculator  Instructions 


a = .9 

b = -.4  ¥(§)  = c + 

c = .2 


(1-c) 
1.7a (0-b) 


Enter 

Key 

Comment 

9 

(pick  one) 

- 

minus 

-.4 

b 

X 

times 

.9 

a 

X 

times 

-1.7 

constant 

X 

e 

- 1 . 7a (0-b) 

+ 

plus 

1 

constant 

1/X 

reciprocal 

Record  your  ^ (9)  here 

9 

^(0) 

3 

2.5 

2 

1.5 

1 

.916 

.5 

.839 

0 

-.5 

.569 

-1 

-1.5 

-2 

-2.5 

-3 

Now  plot'i*  (9) 

vs.  9 below. 

SSHSKSSSSS 

■■■■■■ 


ssassi 

■■■■■! 

sssss: 

■■■■■■ 


■■■ 

!:! 


jggg 

a* 


■ 4+j444-H 

tH — I — [— ) — ) — ( — f — ■ ! 14-4 

f-h  H -H  r H-H-t444 


Tj 

: : 

* i 

* 1 

[ 1 

— < 

CHAPTER  6 

The  Item  Response  Function  (IRF) 


6.1  Let's  consider  2 examinees  ( A 1 and  Bob)  with  different  ability 
levels,  i.e.  different  6s.  Let's  say  A1  has  a higher  6 than  Bob.  That 
means  they  are  located  at  different  places  on  the  9 scale.  See  Figure 
6.1. 


f 
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6.2  What  are  the  chances  that  A1  will  get  item  #1  correct?  What  are 
the  chances  that  Bob  will  get  item  #1  correct?  So  far  we  don't  know 
the  answer  to  either  of  those  questions.  But  we  do  know  one  thing.  A1 
has  a better  chance  of  getting  item  #1  correct  than  Bob,  because  A1  is 
smarter  than  Bob  (in  ability  9).  So  let's  represent  the  probability  of 
each  getting  the  item  correct  by  a point  above  each  (points  A & B)  in 
Figure  6.2. 


■ 


6.3  In  doing  so  we  have  defined  an  ordinate  as  the  probability  of 
getting  the  item  correct  as  a function  of  9 (ability).  This  may  be 
written  Pi  (R|9),  and  read,  "the  Probability  of  getting  item  i correct 
given  (|)  9."  But  for  brevity  it  is  usually  written  P1-(9).  The 
subscript  (i)  is  often  omitted. 
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6.4  Now  let's  take  Carl,  who  is  dumber  (less  ability  0)  than  Bob. 
Carl  has  an  even  smaller  chance  of  getting  the  item  correct.  See 
Figure  6.4a. 
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Figure  6.4a.  The  probabilities  of  A],  Bob,  and  Carl 
getting  Item  # 1 correct. 
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And  let's  also  add  Dave,  and  Ed  and  Fred  who  have  less  9 still.  See 
Figure  6.4b. 


6.5  Since  the  probability  of  getting  the  item  correct  is  only  a 
function  of  the  amount  of  ability,*  we  can  say  that  any  who  has 
the  same  0 as  A1  will  have  the  same  probability  as  A1  of  getting 
the  item  correct  (A).  And,  everyone  who  has  the  same  9 as  Ed  wi 11 
have  the  same  probability  as  Ed  of  getting  the  item  correct  (E), 
and  so  on.  Therefore,  we  can  connect  the  points  in  Figure  6.4c, 
which  will  tell  us  the  P(9)  for  each  Q.  This  curve  is  called  the  Item 
Response  Function  (IRF)  and  was  until  recently  called  the  Item  Char- 
acteristic Curve  (ICC).  See  Figure  6.5 


Figure  6.5.  The  Item  Response  Function  of  Item 

# 1. 


6.6  We  know  several  things  about  this  IRF. 


(1)  It  cannot  rise  higher  than  J,  because  a probability  = 1.0 
is  a sure  thing,  and  nothing  can  be  more  probable  than  a sure  thing. 

(2)  It  will  never  reach  a height  of  1.0,  because  in  testing  there 
is  no  such  thing  as  a sure  thing.  Therefore,  the  curve  has  an  upper 
asymptote  of  1.00. 

(3)  Between  Ed  and  Bob  the  curve  has  to  rise  rapidly,  because  it 
must  rise  from  point  E to  point  B in  the  short  distance  between  Ed's 

9 and  Bob's  0. 


*assuming  uni  dimensionality,  which  will  be  discussed  in  Section  14.4. 


(4)  The  curve  must  always  rise  (i.e.  can  never  be  horizontal  or 
go  down)  as  we  move  from  left  to  right,  because  as  ability  increases, 
so  does  the  probability  of  getting  the  item  correct.  Therefore,  the 
curve  is  strictly  monotonic. 

(5)  It  cannot  go  below  0.00,  because  a probability  = 0.00  is  an 
absolute  impossibility,  and  nothing  can  be  less  probable  than  an 
absolute  impossibility.  Therefore,  the  curve  lias  a lower  asymptote. 

(6)  Since  the  item  is  a multiple-choice  question,  there  is 
usually  a fair  probability  of  getting  the  item  correct  strictly  by 
chance  alone,  no  matter  how  low  the  9.  Tradi tional ly,  we  have  taken 
this  probability  to  be  1/A,  where  A = the  number  of  alternatives  in  the 
multiple-choice  question.  A 4-choice  item  has  been  thought  to  have  a 
chance  probability  of  1/4  = .25,  and  a 5-choice  item,  a chance  pro- 
bability of  1/5  = .20.  Whatever  the  chance  probability  of  getting 

a multiple-choice  item  correct  is,  it  is  not  expected  to  be  zero. 

It  is  expected  to  be  somewhat  greater  than  zero.  Therefore,  the  curve 
in  Figure  0.5  is  expected  to  have  a lower  asymptote  above  zero.  (In 
Section  7.3  we  shall  see  that  the  lower  asymptote  is  seldom  1/A) 

6.7  You  have  probably  noticed  that  all  of  the  things  we  observed  about 
the  IRF  are  also  true  about  the  3-parameter  normal  ogive  and  logistic 
ogi ve. 

Therefore,  we  conclude  that  the  normal  (or  logistic)  ogive  may  be 
used  to  describe  the  IRF  very  well.  And  we  may  use  the  logistic  ogive 
function  to  describe  the  IRF  mathematically. 
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6.8  If  somehow  we  knew  and  we  were  to  plot  the  probabilities  of 
getting  item  #2  correct  for  A1 , Bob,  Carl,  Dave,  Ed,  Fred,  and  Olga,  we 
might  get  an  I RF  like  Figure  6.8. 


6.9  Figure  6.9  shows  both  item  #1  and  item  #2  • 


For  Olga,  Ed  and  Fred  (and  anyone  else  with  their  9s)  the  probability 
(Pjt(9))  of  getting  item  2 correct  is  about  the  same  as  their  P,(0)  for 
item  #1. 

But  item  #2  is  harder  for  A1 , Bob,  Carl,  and  Dave  than  item  #1, 
because  for  all  of  them  the  probability  of  getting  item  #2  correct 
(f*(9))  1S  lower  than  the  probability  of  getting  item  #1  correct.  And 
it  would  be  harder  for  anyone  who  has  the  same  ability  as  A1 , Bob, 

Carl,  or  Dave. 

6.10  We  also  notice  that  the  probabilities  of  getting  item  #2  correct 
for  Bob,  Carl,  Dave,  Ed  and  Fred  are  all  about  the  same.  Item  #2, 
then,  does  not  do  a good  job  in  distinguishing  among  people  with 
abilities  like  Bob's  or  below.  This  observation  is  consistent  with 
what  we  intuitively  understand  about  items.  A hard  item  does  not 
discriminate  among  low  ability  people,  because  they  all  get  it  wrong 
(unless  they  make  a lucky  guess).  An  easy  item  does  not  distinguish 
among  high  ability  people,  because  they  all  get  it  correct.  A test 
composed  of  items  with  IRFs  like  item  # 2's  IRF  would  not  be  a good  test 
for  measuring  the  relative  ability  of  people  like  Bob,  Carl,  Dave,  Ed 
and  Fred. 

Note:  In  practice,  any  particular  examinee  may  either  know  the  answer 
to  a particular  item  (in  which  case  his  probability  of  getting  it 
correct  is  1.00),  or  not  know  it  (in  which  case  his  probability  of 
getting  it  correct  is  chance).  Strictly  speaking,  we  can  not  talk  about 
the  probability  of  a particular  person  getting  correct  a particular 
item.  However,  for  pedagogical  reasons  we  will  violate  this  restriction 
in  this  section. (See  Section  8.2  for  clarification.) 
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6.11  However,  Olga's  £(9)  for  item  # 2 is  much  higher  than  Al's 
^(9).  Therefore,  item  #2  will  distinguish  between  people  like  A1  and 
Olga.  If  a distinction  in  that  range  of  ability  is  our  purpose,  then 
a test  made  of  items  like  #2  would  be  a pretty  good  test. 


6.12  Item  #3  might  have  an  IRF  like  that  in  Figure  6.12.  This  item 
rises  over  a longer  range  than  does  either  item  #1  or  item  #2,  but  its 
slope  is  less  at  every  point  during  its  rise.  This  low  slope  means 
that  item  #3  is  discriminating  over  a wide  range  of  9,  but  is  not 
doing  so  well  at  any  particular  9. 


FRED  ED  DAVE  CARL  BOB  0 AL 


Figure  6.12.  The  IRF  of  Item  # 3. 


6.13  Figure  6.13  shows  the  IRFs  for  both  item  #1  and  item  #3 


It  is  interesting  to  note  that  item  #3  is  harder  than  item  #1  for 
A1  and  Bob,  but  easier  for  Dave,  Ed,  and  Fred.  This  possibility  of 
reversed  relative  item  difficulty  for  persons  of  different  ability  is 
one  of  the  surprising  results  of  IRT. 

6.14  We  have  seen  that  the  greater  the  slope  of  the  IRF,  the  greater 
the  discrimination,  but  the  smaller  the  range  of  discrimination.  We 
have  already  noted  in  Chapter  5 that  the  a-parameter  of  the  logistic 
ogive  describes  its  slope.  Therefore,  the  a-value  is  called  the 
discrimination  index  of  the  IRF.  The  greater  the  a-value  of  the  IRF, 
the  better  the  item  discriminates. 

6.15  Also  apparent  is  the  fact  that  the  shift  of  the  IRF  as  a whole 
to  the  left  makes  the  item  easier  in  general,  and  to  the  right  makes 
the  item  harder  in  general.  The  left-right  shift  of  the  logistic  ogive 
is  described  by  the  b-parameter.  Thus,  the  b-value  is  the  difficulty 
index  of  the  IRF.  The  more  difficult  the  item  is,  the  larger  (in  the 
positive  direction)  the  b-value  of  the  IRF. 

6.16  The  IRFs  of  items  1,  2,  and  3 have  different  lower  asymptotes. 
Since  the  IRF  never  goes  below  the  lower  asymptote,  this  difference  in 
IRFs  means  that  the  items  are  of  different  difficulty  even  for  exam- 
inees of  very  low  ability.  But  examinees  of  very  low  ability  will 
know  almost  nothing  about  the  item,  and  therefore  have  to  guess.  The 
difference  in  lower  asymptotes  of  IRF 1 s means  that  very  low  ability 
examinees  have  a better  chance  of  guessing  the  correct  choice  of  some 
items  that  of  others.  This  result  of  IRT  will  be  discussed  further  in 
Section  7.3.  The  lower  asymptote  of  the  logistic  ogive  is  the  c- 
parameter.  The  c-value  of  an  IRF  is  called  the  "guessing  index"  or 
more  properly  the  "pseudo-guessing  index"  of  the  item.  Both  terms  are 
used. 
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Figure  6.17.  The  IRFs  of  four  actual  items  from  the 
Coast  Guard  Knowledge  section  of  the  U.  S.  Coast 
Guard  Warrant  Officer  Test,  series  8. 


6.17  Figure  6.17  shows  the  IRF's  for  4 actual  items  from  the  Coast 
Guard  Knowledge  section  of  the  U.S.  Coast  Guard  Warrant  Officer  test. 
Item  #17  is  a very  difficult,  but  highly  discriminating  item.  It  has  a 
c-value  of  .00,  which  means  that  nearly  all  examinees  below  0=1, 
answered  the  item  incorrectly.  Item  #17  is  a very  unusual  item  in  two 
respects,  its  extremely  high  a-value,  and  .00  c-value.  It  is,  however, 
an  ideal  item  for  many  purposes. 

Item  #21  is  an  easy  item  with  somewhat  low  discrimination.  Item 
#47  is  slightly  easier  than  #21,  but  has  good  discrimination.  Item  #50 
is  an  item  with  medium  difficulty,  and  poor  discrimination. 

6.18  The  IRF  should  not  be  confused  with  the  item-test  curve.  The 
item-test  curve  has  raw  score  as  the  horizontal  axis  instead  of  0. 

The  item- test  curve,  therefore,  suffers  from  the  same  problems  of 
distorted  scale  as  the  raw  score.  The  item- test  curve  has  no  par- 
ticular shape,  and  is  not  independent  of  the  other  items  in  the  test. 

In  fact,  the  average  of  the  item-test  curves  of  all  items  in  a test  is 
always  a straight  line  of  slope  = 1 ( i . e . 45°).  Thus,  for  many  purposes 
the  item-test  curve  is  useless  as  an  analytic  tool. 
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CHAPTER  7 

The  a,  b,  & c parameters 


7.1  The  a-value  is  the  discrimination  index  of  the  item.  If  6 is 
normally  distributed,  in  the  normal  ogive  model  the  a-value  is  related 
to  the  d-value  in  the  following  very  complex  way  (from  Schmidt,  1977). 


'(KR-20)(l-c?y2-d2pq 


where  d = d-value,  the  point  biserial  item-test  correlation 

p = p-value,  the  proportion  of  examinees  correctly  answering  the  item 
q = 1-P 

KR-20  = Kuder-Richardson  formula  20  reliability 

y = the  height  of  the  N(0,1)  curve  at  the  z score  that  cuts  off 
p/ proportion  of  the  area  under  the  N(0,1)  frequency  function. 

c = c-value 

p'  -U 
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The  a-value  is  related  to  the  slope  of  the  IRF,  and  can  range  from 
0.0  to  + po  just  as  the  slope  can.  Negative  slopes  are  possible,  but 
not  of  interest  to  us.  Experience  has  shown  that  a-values  of  typical 
items  vary  from  about  .5  to  2.5  with  most  from  1.0  to  2.0.  The  highest 
I have  observed  is  3.76.  An  item  with  a low  a-value  discriminates 
poorly  over  a wide  range  of  0.  With  a high  a-value  the  item  discri- 
minates well,  but  over  a small  range  of  0.  Items  with  a-values  below 
.80  are  not  very  good  items  for  most  purposes. 

7.2  The  b-value  is  the  difficulty  index.  If  0 is  normally  distributed, 
it  is  related  to  the  p-value  in  the  normal  ogive  model  (from  Schmidt, 
1977)  in  the  following  way: 


b» 


yz  ( I -c)«J  KR-20 


where  z = the  z-score  that  cuts  off  p/proportion  in  the  upper  portion 
of  the  area  under  the  N(0,1)  frequency  function,  and  the  other  symbols 
are  as  defined  in  Section  7.1  above.  Typical  b-values  range  from  -2.5 
to  +2.5.  A b-value  of  -2.5  indicates  the  item  is  very  easy.  An  item 
with  a +2.5  b-value  is  very  difficult,  and  items  with  0.0  b-values  are 
of  medium  difficulty. 
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7.3  The  c-value  is  the  guessing  parameter  or  pseudo-guessing  para- 
meter. It  indicates  the  probability  of  examinees  with  very  low 
ability  of  getting  the  item  correct.  Most  c-values  range  from  .00  to 
.40.  Items  with  c-values  of  .30  or  greater  are  not  very  good  items. 

It  is  desirable  to  have  the  c-value  at  .20  or  less.  The  lower  the 
c-value  is,  the  better.  A zero  c-value  is  ideal.  Typically,  the 
c-value  is  about  1/A  - .05,  where  A = the  # of  alternatives.  Thus, 
4-choice  items  often  have  c Jfc.20  (i.e.  .25-. 05),  and  5-choice  items 
often  have  c £C*15  (i.e.  .20-. 05). 

Items  do  not  have  a c-value  of  1/A  because  examinees  do  not,  in 
fact,  guess  randomly  when  they  do  not  know  the  answer  (as  has  often 
been  assumed  in  classical  test  theory  analyses). 

7.4  Two  explanations  have  been  offered  for  the  fact  of  non-random 
guessing  (c^l/A). 

Lord  has  suggested  that  item  writers  are  very  clever  in  writing 
distractors  that  are  very  attractive  to  low  ability  examinees.  Thus, 
when  low  9 examinees  do  not  know  the  answer  they  are  attracted  more  to 
distractors  than  to  the  correct  answer,  and  so  get  the  item  wrong  more 
often  than  if  they  guessed  randomly. 

The  other  explanation  is  my  own,  based  upon  personal  knowledge  of 
item  writing  and  test  taking  behavior: 


(1)  When  an  item  writer  sits  down  to  write  items,  he,  for  the 
moment,  is  not  concerned  with  the  distribution  of  the  correct  answers 
(the  keyed  choices)  among  the  four  (for  four-choice  items)  possible 
positions  (i.e.  choice  A,  choice  B,  choice  C,  and  choice  D). 
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(2)  He  has  a tendency  to  try  to  hide  the  correct  choice.  In  a 
four-choice  item  there  are  only  2 places  to  hide  it  - choice  B,  or 
choice  C.  Therefore,  he  writes  many  more  items,  keyed  B or  C than  A 
or  D,  and  in  fact  there  seems  to  be  a much  stronger  tendency  toward  C. 
(I  have  verified  this  tendency  with  many  item  writers).  This  also 
seems  to  be  true  for  5-choice  items. 

(3)  When  he  finishes  writing  the  items,  he  tabulates  the  numbers 
of  items  keyed  for  each  position,  and  usually  finds  that  he  has  many 
more  C's  than  A's,  B's,  or  D's  (or  E's  in  5-choice  items). 

(4)  Most  testing  organizations  have  a requirement  that  there 
should  be  about  equal  numbers  of  items  with  the  keyed  choice  in  each 
of  the  4 or  5 possible  positions. 

(5)  The  item  writer  then  begins  to  revise  the  order  of  the 
choices  in  items  to  decrease  the  number  of  items  keyed  C,  and  increase 
the  number  of  items  keyed  A and  D and  maybe  B.  He  continues  to  revise 
the  order  of  the  choices  of  items  until  he  has  satisfied  the  require- 
ment of  about  equal  numbers  of  keyed  choices  in  each  position. 

(6)  Naturally,  to  save  himself  work  and  time  (the  Law  of  Least 
Effort)  he  wants  to  revise  as  few  items  as  possible.  Therefore,  he 
stops  revising  items  when  he  gets  within  the  requirement  of  about 
equal  numbers.  Because  he  started  with  more  items  keyed  C,  he  also 
ends  up  with  more  items  keyed  C (but  not  as  many),  because  he  only 
needs  about  equal  numbers. 

If  the  above  scenario  is  as  universal  as  I believe,  it  means 
that,  in  the  set  of  all  multiple-choice  items  in  the  world,  more  are 
keyed  C than  any  other  choice.  It  is  true  of  almost  all  of  the  tests  I 
have  checked. 
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There  is  a widespread  rule  of  thumb  among  examinees:  "If  you 
don't  know  at  all,  guess  C."  I have  heard  this  rule  of  thumb  from 
coast  to  coast,  from  high  school  and  college  students,  and  from 
civilian  employees  and  military  personnel  taking  promotional  tests. 

I do  not  know  the  source  of  this  rule  of  thumb,  but  it  is  possible 
that  the  rule  of  thumb  gradually  grew  from  examinees'  observations 
of  the  frequency  of  keyed  choice  positions,  as  I have  suggested 
above. 

Whatever  the  origin  of  the  rule  of  thumb,  it  represents  rational 
behavior,  given  a higher  frequency  of  choices,  keyed  C,  among  the 
population  of  all  multiple-choice  items.  By  choosing  choice  C (when 
you  don't  know  at  all),  you  will  get  more  items  correct  by  chance  in 
the  long  run  than  by  guessing  at  random. 

This  analysis  suggests  that  the  c-values  of  items  keyed  C will 
be  higher  than  for  items  keyed  A,  B,  and  D.  I was  able  to  test  this 
hypothesis  with  127  items  from  6 forms  of  the  verbal  parts  of  the 
SCAT-II  series  of  tests,  published  by  the  Educational  Testing  Ser- 
vices, Princeton,  NJ.  The  c-values  were  provided  by  Fred  Lord. 

A two-by-two  frequency  table  of  A,  B,  D vs  C by  above-average  c-value 
vs  below-average  c-value  yielded  a Chi  square  significant  beyond  the 
.001  level.  This  result  strongly  supports  the  hypothesis  that  low 
ability  examinees  get  items  keyed  C correct  more  often  than  they  get 
items  keyed  A,  B,  or  D correct. 

The  results  suggest  2 alternative  courses  of  action  for  testing 
organizations. 

(1)  Require  that  there  be  exactly  the  same  number  of  keys 
in  each  position.  This  action  would  thwart  the  test-wiseness 
of  those  who  use  the  rule  of  thumb.  However,  it  represents  an 
undesirable  rigidity. 
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(2)  A better  course  of  action  would  be  to  key  C for  less  than 
1/4  of  the  items  (for  4-choice  items).  This  action  would  cause 
a lower  average  c-value  for  the  test.  The  lower  average  c-value 
would  increase  the  total  information  in  the  test,  which  as  we 
will  see  in  Sec.  9.4  is  highly  desirable. 


7.5  The  Rasch  model  assumes  that  all  items  in  a test  have  the  same 
a-value,  and  that  c = .00  for  all  items.  Both  assumptions  are  nearly 
always  unrealistic. 


CHAPTER  8 

The  Test  Characteristic  Curve 


8.1  The  scale  of  9 is  continuous,  but  since  most  of  the  calculations 
are  done  on  digital  computers,  0 is  usually  broken  into  small,  dis- 
crete intervals  of  .05  9 units,  and  values  of  P(0)  are  calculated  for 
each  .05  interval  from  9 = -5.0  to  9 = +5.0.  The  very  broad  range 
from  -5.0  to  5.0,  and  the  small  .05  intervals  are  used  in  the  interest 
of  accuracy.  Larger  or  smaller  intervals  and  a broader  or  narrower 
range  may  be  used  depending  on  the  purpose  and  degree  of  accuracy 
desired. 

8.2  Table  8.2  below  gives  the  P(0)  for  17  values  of  9 for  each  of  the 
4 items,  shown  in  Figure  6.17. 


1 

j ! 

P(9) 


9 

#17 

#21 

#47 

#50 

IP(0) 

-2.7 

.00 

.30 

.38 

.20 

.88 

-2.3 

.00 

.33 

.40 

.23 

.96 

-2.0 

.00 

.37 

.45 

.25 

1.07 

-1.7 

.00 

.43 

.52 

.28 

1.23 

-1.3 

.00 

.53 

.66 

.33 

1.52 

-1.0 

.00 

.71 

.87 

.44 

2.02 

-.7 

.00 

.62 

.77 

.48 

1.77 

-.3 

.00 

.82 

.94 

.52 

2.28 

0 

.00 

.88 

.97 

.59 

2.44 

.3 

.00 

.92 

.99 

.65 

2.56 

.7 

.00 

.96 

.99 

.74 

2.69 

1.0 

.01 

.97 

.99 

.79 

2.75 

1.3 

.04 

.98 

.99 

.84 

2.85 

1.7 

.35 

.99 

.99 

.89 

3.22 

2.0 

.78 

.99 

.99 

.91 

3.67 

2.3 

.96 

.99 

.99 

.94 

3.88 

2.7 

.99 

.99 

.99 

.96 

3.93 

Table  8.2 

An  item  is  scored  dichotomously,  which  means  the  examinee  either 
gets  the  item  correct  (for  which  he  gets  an  observed  score  of  1)  or 
he  gets  the  item  wrong  (for  which  he  gets  an  observed  score  of  0). 

The  dichotomous  score  is  a result  of  the  typical  use  of  multiple- 
choice  items.  An  examinee's  dichotomous  score  (0  or  1)  is  not  a 
very  accurate  measure  of  his  knowledge. 


* 


P(9)  may  be  interpreted  in  two  ways.  A P(9)  = .78  means  both: 


(1)  78%  of  the  examinees  with  the  given  9 will  get  the 
item  correct,  and 

(2)  An  examinee  will  get  correct  70%  of  the  items  for 
which  his  P(9)  = .78. 

If  an  examinee  answers  100  questions  for  all  of  which  his  P(9) 

= .78,  he  is  expected  to  get  78  items  correct  and  22  items  wrong  for  a 
% score  of  78%.  If  there  were  some  way  to  give  him  partial  credit  of 
.78  points  for  each  of  the  100  items  instead  of  0 or  1 point  he  would 
also  get  a 1 score  of  78%.  This  notion  of  partial  credit  for  an  item 
depending  on  his  P(9),  leads  to  the  idea  of  a true  score  on  the  item. 

It  is  often  not  true  that  the  examinee  is  100%  or  0%  certain  of 
his  answer.  Yet  on  a multiple-choice  item  he  either  gets  full  (100%) 
credit  for  the  item  (1,  if  he  gets  it  correct)  or  no  (01)  credit 
(0,  if  he  gets  it  wrong).  The  examinee's  degree  of  certainty,  if 
measurable  could  be  taken  as  a more  precise  measure  of  his  knowledge. 
P(9)  might  be  interpreted  as  this  measure  of  his  knowledge,  and  is 
called  his  true  score  on  the  item.  The  sum  of  his  true  item  scores 
is  his  true  test  score.  His  true  test  score  is  the  raw  score  he 
would  get,  if  there  were  no  measurement  error  in  the  test. 

The  far  right  column  in  Table  8.2  is  the  sum  of  the  P ( 9 ) ' s of  the 
4 items  for  each  of  the  listed  points  on  the  9 scale.  The^P(9)  is 
the  true  test  score  of  an  examinee  with  a given  9 on  a test  composed 
of  the  4 items. 
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8.3  If  we  plot  the  true  test  scores  against  0,  we  get  a test 
characteristic  curve  (TCC).  Figure  8.3  shows  the  TCC.  The  TCC 
gives  the  true  score  for  each  point  on  the  0 scale.  Notice  that 
the  TCC  is  neither  a straight  line  nor  an  ogive.  Each  test  will 
have  its  own  TCC,  which  is  the  sum  of  the  IRF's  of  the  items  in 
the  test. 

8.4  One  of  the  interesting  uses  of  the  TCC  is  to  determine  the 
distribution  of  the  true  scores  on  the  test.  Figure  8.4  shows  how 
this  is  done.  If  the  examinees'  0s  are  normally  distributed,  as 
shown  on  0 (upside  down),  the  examinees'  true  scores  will  be  as  shown 
on  the  left.  The  true  score  distribution  is  found  by  projecting  the 
intervals  from  the  0 scale  onto  the  TCC,  and  then  representing  the 
same  area  on  the  true  score  scale  within  the  projected  intervals. 
Figure  8.4  is  an  excellent  demonstration  of  how  the  peculiarities  of 
a test  produce  a distorted  metric. 

8.5  It  is  important  to  note  that  true  scores  (T)  are  not  observed 
scores  (X).  Observed  score  is  defined  as  true  score  plus  error 

(X  = T + E).  However,  Lord  (1969)  has  found  that  the  distribution 
of  X will  be  similar  to  the  distribution  of  T,  but  sometimes  with 
the  high  points  of  the  true  score  distribution  flattened  somewhat, 
and  the  low  points  higher.  The  flattening  is  due  to  error. 


WO-8  CGK  ITEMS  17,21,47+50. 
AFFECT  OF  TEST  CHARACTERISTIC 
ON  DISTRIBUTION  OF  TRUE  SCORE 


Figure  8.4.  An  illustration  of  the  use  of  the  Test 
Characteristic  Curve  to  relate  the  distributions  of  6 
and  True  Score. 


CHAPTER  9 

The  Item  Information  Function  ( 1 1 F ) 


9.1  We  can  see  in  Figure  6.17a  that  item  #17  will  not  help  us  to 
distinguish  among  examinees  whose  0‘s  are  less  than  1.0  because 
they  will  all  get  the  item  wrong.  Apparantly,  there  is  something 
about  item  #17  that  leads  all  examinees  with  0<  1.0  to  choose 
the  wrong  alternative.  This  is  an  unusual  situation,  but 
actually  occurs  with  this  question.  A test  made  exclusively  of  items 
like  #17  would  do  nothing  to  distinguish  among  examinees  with  9< 

1.0  because  they  would  all  get  zero  on  the  test.  It  would  give  us  no 
distinguishing  information  about  them. 

Item  #17  also  gives  us  no  distinguishing  information  about 
examinees  with  0 = 2.7  or  greater  because  they  will  all  get  it 
correct.  On  a test  composed  of  items  like  #17,  all  examinees  with 
9>  2.7  would  get  100%. 


3etween  0=1.0  and  9=2.7,  it  is  a different  story.  From  0=1.0 
to  0=1.5,  P (0 ) goes  from  P(9=1.0)=.00  to  P (0=1.5)=. 08.  The  change 
of  P(9)  means  that  the  item  does  help  to  distinguish  among  examinees 
within  the  range  of  9 where  the  change  of  P (9)  occurs.  In  this  case 
the  difference  between  the  P(9)'s  (to  be  denoted  dp)  = .08  (.08-. 00) 
is  small.  The  change  (dp)  occurs  over  a range  (d0)  of  1/2  9 units 
(1. 5-1.0).  The  ratio  of  dp  to  d9  (dp/d0)  is  equal  to  the  average 
slope  of  the  IRF  over  the  range  of  d9.  For  the  range  from  9=1.0  to 
0=1.5,  dp/d0  = .08/. 5 = .16. 


From  0 = 1.5  to  0=  2.0  for  item  #17,  P(0)  changes  from  .08  to 
.78,  a very  large  change,  dp  = .70  (.78-. 08)  in  this  range,  and 
dp/d0  = .70/. 5 = 1.40,  which  is  very  large.  Item  #17  is  an  excellent 
item  for  distinguishing  among  examinees  in  the  range  0 = 1.5  to  0 = 
2.0.  A test  composed  of  items  like  #17  would  give  scores  from  about 
8%  to  78%  for  examinees  whose  0's  go  from  1.5  to  2.0.  This  test 
would  give  us  a lot  of  distinguishing  information  about  examinees  in 
this  range  of  0,  because  it  would  spread  them  out  over  a wide  range 
of  test  scores. 

We  can  see  that  the  greater  the  slope  of  the  IRF,  the  more  in- 
formation the  item  gives  us  about  examinees  in  the  range  being 
considered. 

9.2  If  we  could  make  the  range  of  0 over  which  we  find  the  slope 
smaller  and  smaller,  we  would  eventually  get  to  the  slope  of  the  IRF 
at  a point  which  would  be  the  slope  of  the  tangent  line  to  the  IRF  at 
a particular  point  of  0. 

The  slope  of  the  IRF  would  be  a measure  of  the  relative  amount 
of  information  the  item  gives  about  examinees  at  that  point.  The 
greater  the  slope,  the  more  information. 


1 


f 
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Fortunately,  there  is  an  easy  way  to  find  the  slope  of  the 
logistic  ogive.  The  slope  of  the  IRF  is  given  by: 


,7a(l-c)e 


!.7a(0-b) 


de  [,  + ei.7o<e-b)j2 


where  a,  b,  and  c are  the  i terr  parameters  and  9 is  the  point 
where  dp/d0is  the  slope.  The  slope  is  also  sometimes  denoted  as 
P'(0),  or  P'  for  short.  In  calculus  P 1 (0 ) is  known  as  the  first 
derivative  of  P(9).  Since  the  slope  (P1)  is  a measure  of  information, 
it  is  possible  to  plot  a curve  that  shows  the  amount  of  information 
an  item  gives  at  each  point  on  the  0 scale. 


9.3  However,  there  is  a catch.  For  mathematical  and  statistical 
reasons  which  we  will  not  go  into,  P ' (0 ) is  not  a completely 
appropriate  measure  of  information,  but  a related  function  is. 


The  function  is: 


1(0, u)=. 


*9)0(9) 


(I7af  (l-c) 
e ,7a(G-b][l  + i 


-l7a(Q-b)\ 


where  P'  is  P'  squared,  and  Q(0)  = 1 - P(0).  Note  that  the 
exponent  of  the  left  e in  the  denominator  is  positive,  and  the 
exponent  of  the  right  e is  negative. 


Figure  9.4a.  The  Item  Information  Functions  of  four 
real  items. 


That  function  is  called  the  Item  Information  Function  ( 1 1 F ) , and 
is  written  I(9,u).  The  above  formula  for  I(9,u)  may  look  even  more 
ominous  than  the  formula  for  P(9),  but  in  fact  it  is  only  slightly 
more  complicated.  It  is  still  feasible  to  calculate  points  of 
1(9, u)  with  a typical  scientific  hand  calculator. 

9.4  Figure  9.4a  shows  the  1(9, u)  for  the  four  items  whose  IRF's  are 
shown  in  Figure  6.17.  (Note  that  the  vertical  scale  for  item  #17  is 
different  from  the  others.)  In  comparing  the  IRFs  with  the  IIFs, 
you  will  note  three  important  relationships. 


(1)  The  I I F is  highest  close  to  where  the  slope  of  the  IRF  is 
steepest. 

(2)  The  total  area  under  the  I I F increases  as  the  a-value 
i ncreases . 


(3)  The  total  area  under  the  1 1 F decreases  as  the  c-value 


increases. 


The  fact  that  total  information  (i.e.  total  area  under  the  1 1 F) 
increases  as  the  a-value  increases,  demonstrates  the  importance  of 
high  a-values  for  items.  However,  there  is  another  effect  of  high 
a-values.  As  the  a-value  increases,  the  width  of  the  9 scale  over 
which  the  information  is  distributed  decreases.  The  effect  is  called 
the  bandwidth  paradox*.  Thus,  sometimes  a compromise  must  be  made 
between  the  total  information  provided  by  the  item  and  the  distri- 
bution of  information  over  9. 

*This  bandwidth  paradox  is  different  from  the  bandwidth  paradox 
described  by  Cronbach  (1960,  p.602). 
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c-value 

Figure  9.4b.  The  relationship  of  the  c-value  tc 
total  information  provided  by  an  item  (given  a). 


The  total  information  ( Ag ) of  item  g is  given  by 


= 1.7a  (c  • log  cf  (l-c))  |.7a  4-  1 7acl0^  c=,7a(Tf£i2li) 

Ag  l_c  \ |.c  / 

I - c 

where  a and  c are  the  item  parameters  and  log  c is  the  natural  log- 
arithm of  c.  From  inspection  of  the  formula  for  A^,  you  can  see  that 
as  the  a-value  increases,  so  does  Ag.  Also  apparent  is  the  fact  that, 
as  c approaches  zero,  A^  approaches  1.7a.  Therefore,  the  maximum 
total  information  an  item  can  provide  is  1.7a.  Not  so  obvious  from 
the  formula  for  A^  is  the  relation  that,  as  c approaches  1.00,  A^ 
approaches  zero.  This  occurs  because  log  c is  negative  except  when  c 
= 1,  and  because  when  c = 1,  c log  c/(l-c)  = -1.  This  relation 
explains  the  effect  of  the  c-value:  the  c-value  destroys  information. 
Figure  9.4b  shows  how  total  information  decreases  as  c increases  while 
holding  the  a-value  constant. 

Since  the  b-value  is  not  included  in  formula  for  Ag,  the  b-value 
does  not  affect  the  total  information. 

9.5  The  point  on  9 where  the  I I F is  highest  is  not  at  the  b-value, 
as  one  might  expect  (except  when  c=0).  The  point  on  9 where  informa- 
tion is  greatest  is  given  by 


where  "log"  means  the  natural  logarithm. 
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The  point  on  0 where  information  is  raxirized  is  always  to  the 
right  of  the  b-value,  (except  when  c=0,  it  is  at  the  t-value),  but 
never  farther  to  the  rirht  than  .41/a. 

9.C  The  1 1 F is  symmetrical  when  c=0  and  skewed  to  the  right  when 
The  larger  is  c,  the  greater  the  right-skew.  The  right-skew 
occurs  because  the  c-value  destroys  more  information  at  low  levels 
of  9 than  at  high  levels.  This  result  makes  sense  because  examinees 
at  low  0s  will  guess  more  than  examinees  at  high  9s.  Guessing  (i.e. 
the  opportunity  to  get  the  item  correct  by  guessing)  destroys  infor- 
mation. It  is  for  this  reason  that  five-choice  iters  are  preferred  to 
four-choice  items. 


CHAPTER  10 


The  Test  Information  Curve  and  Relative  Efficiency  Curve. 

10.1  The  Test  Information  Curve  (TIC)  is  nothing  more  than  the  sum  of 
the  IIFs.  I I Fs  are  summed  by  "stacking  them  on  top  of  each  other." 
"Stacking"  IIFs  merely  means  that  the  heights  (i.e.  the  amount  of 
information)  of  the  IIFs  at  a particular  value  of  0 are  added  together 
to  get  the  height  of  the  TIC  at  that  value  of  0.  Plotting  the  sum  of 
item  information  at  each  value  of  0 gives  the  TIC.  The  height  of  the 
TIC  at  0 is  written  as  1(0). 


10.2  Figure  10.2a  shows  the  sum  of  the  IIFs  for  items  #17  and  21  as 
shown  in  Figure  9.4a.  Figure  10.2b  shows  the  1 1 F of  item  #47  added  to 
Figure  10.2a.  Figure  10.2c  shows  the  1 1 F of  item  #50  added  to  the 
other  3 items.  A test  composed  of  these  four  items  would  have  the 
wierd  TIC  in  Figure  10.2c. 

10.3  The  TIC  shows  the  relative  amounts  of  information  provided  by 
the  test  at  each  point  on  0.  Where  you  want  information  depends  on 
what  you  will  use  the  test  for.  If  you  want  to  select  a few  examinees 
from  a large  number,  then  you  want  a lot  of  information  at  high  levels 
of  0,  so  that  you  can  tell  just  which  examinees  are  the  best.  For 
example,  see  Figure  10.3a.  If  you  want  to  select  all  examinees  except 
a few,  then  you  want  a lot  of  information  at  low  0s  so  you  can  tell 
which  examinees  are  the  worst  (e.g.  see  Figure  10.3b). 
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Sometimes  a test  is  designed  for  more  than  one  purpose,  such  as 
to  be  used  with  two  cut  scores  for  entrance  into  two  different 
schools.  In  this  case  a two-humped  TIC  will  give  good  information  at 
the  two  cut  scores,  (e.g.  see  Figure  10.3c). 

A TIC  of  any  desired  shape  may  be  constructed,  provided  the 
items  with  the  necessary  IIFs  are  available  to  construct  the  TIC. 

10.4  Usually  we  already  have  a test  and  want  to  revise  it  to  make  it 
better  serve  our  purpose.  A comparision  of  the  new  and  old  versions 
should  be  made  using  the  Relative  EfficiencyCurve  (REC).  The  REC  is 
nothing  more  than  the  ratio  of  the  TICs.  The  ratio  of  the  two  curves 
is  found  by  dividing  the  1(0)  of  one  test  by  the  1(9)  of  the  other 
test  at  each  point  on  9.  Figure  10.4  is  the  REC,  conparing  the  TIC 
in  Figure  10.3c  to  the  TIC  in  Figure  10.3b. 

Where  the  REC  is  above  1.0,  the  test  in  Figure  10. 3c (the  test 
for  which  the  1(9)  is  the  numerator  of  the  REC  ratio)  is  better  than 
the  test  for  Figure  10.3b.  Where  the  REC  is  below  1.0,  the  test  for 
Figure  10.3b  is  better.  And  where  the  REC  = 1.0,  the  two  tests  are 
the  same. 

By  starting  with  an  old  test,  making  substitutions  of  items,  and 
calculating  the  REC,  you  can  experiment  with  and  improve  the  old  test 
by  trial  and  error.  It  does  not  take  long  to  develop  some  skill  in 
replacing  items  to  improve  the  TIC  as  desired. 

10.5  Every  test  has  some  error  in  it.  The  Standard  Error  of  Estimate 
(S.E.E.)  is  the  expected  standard  deviation  of  errors  of  estimated 
ability.  That  is,  if  we  were  to  give  a test  to  a group  of  examinees 
with  identical  9s,  and  estimate  their  9s  with  the  test,  the  standard 
deviation  of  those  estimates  would  be  the  S.E.E. 
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10.6  If  the  estimate  of  9 is  a maximum  likelihood  estimate  (see 
Chapter  12),  the  S.E.E.  at  a particular  9 is  easy  to  calculate  from 
the  TIC.  The  S.E.E.  is  equal  to  the  square  root  of  the  reciprocal 
of  the  height  of  the  TIC  (1(9)). 


vif&) 


Since  1(9)  varies  along  the  9 scale,  so  will  the  S.E.E.  The 
larger  1(0)  is,  the  smaller  the  S.E.E.  A small  S.E.E.  at  a cut  point 
is  highly  desirable. 


10.7  The  average  S.E.E.  (S.E.E.)  over  examinees  is  related  to  the 
reliability  of  Classical  Test  Theory  (rxx),  when  the  scores  are  stand- 
ardized to  a standard  deviation  = 1.0. 


1 


This  relation  implies  that  a test  with  high  reliability  may  be  a 
poor  test  for  your  purposes  because  it  has  low  information  at  the 
critical  values  of  0.  Similarly,  a test  with  low  reliability  ray  be  ar, 
excellent  test  for  some  purposes,  if  it  has  high  information  where  it 
is  needed.  Thus,  reliability  is  highly  misleading  as  to  the  value  of  a 
test. 

The  relation  also  makes  clear  the  dependence  of  reliability  on  the 
distribution  of  ability.  If  many  examinees  are  on  the  0 scale  where 
there  is  high  information,  then  the  reliability  will  be  higher  than  if 
they  are  distributed  on  0 at  points  where  information  is  low. 
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CHAPTER  11 

The  Score  Information  Curve 


11.1  The  test  information  curve  (1(9))  gives  the  maximum  amount  of 
information  about  9 that  can  be  extracted  from  the  test.  However,  to 
get  the  maximum  information,  items  must  be  optimally  weighed.  The 
optimal  weight  (W(Q))  of  an  item  is  given  by 


Wi(9) 


Pi 
Pi  Qi 


l.7ae 


!.7a  (Q-b) 


1.7b  (9- b) 

c+  e 


There  is  a curious  characteristic  of  W(9).  It  varies  with  9. 

That  means  that  item  A should  receive  different  weights  for  examinees 
with  different  9s.  But  to  get  W(9),  you  must  know  9,  which  is  what 
you  are  trying  to  get  by  giving  the  test. 

11.2  There  are  two  ways  to  approach  this  dilemma. 

(1)  The  most  satisfactory  way  is  to  use  an  iterative  computer 
program,  such  as  LOGIST  or  OGIVIA  (see  Chap.  15).  These  computer 
programs,  in  effect,  make  use  of  the  optimal  item  weights  and 
hence  yield  maximum  information  about  9. 

(2)  A rough  approximation  would  be  to  take  raw  scores  on  the 
test,  divide  the  distribution  of  raw  scores  into,  say,  top. 
middle  and  bottom  groups  and  tnen  rescore  using  different 
item  weights  for  each  group.  This  procedure  would  not  yield 
maximum  information,  but  would  provide  more  information  than 
not  using  variable  item  weights  at  all. 
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11.3  If  neither  of  the  options  in  Section  11.2  is  possible,  then  you 
pay  have  to  resort  to  the  use  of  number-ri ght  score.  In  this  case 
the  ar,;ount  of  inforr.ation  provided  by  this  scoring  procedure  becopes 
of  interest.  The  amount  of  information  provided  by  a number-right 
score  is  called  the  number-right  Score  Information  Curve  (SIC).  The 
formula  for  the  SIC  (also  written  as  1(0, X))  is 


I 


(9,x)  = 


(SPl)2 
SPi  Qi 


11.4  The  SIC  usually  has  the  same  general  shape  as  the  TIC,  but  is 
lower  than  the  TIC  at  all  values  of  0.  At  high  0 the  TIC  and  SIC  will 
be  nearly  the  same  height  (i.e.  SIC/TIC  zs  1.0) . As  0 becomes  smaller 
and  smaller,  SIC/TIC  becomes  smaller.  This  result  means  that,  at  high 
0s  little  information  is  lost  by  using  a number-right  score,  but  at  low 
0s  relatively  much  information  is  lost.  Such  is  the  penalty  for  use  of 
the  inefficient  number-right  score. 

11.5  The  SICs  of  two  tests  may  be  used  just  as  the  TICs  are  used.  A 
rough  approximation  of  the  standard  error  of  estimate  may  be  found  for 
each  9 using  the  number-right  scoring  procedure,  and  the  ratio  of  the  SICs 
of  two  number-right  scored  tests  may  be  interpreted  in  the  same  manner  as 
the  Relative  Efficiency  Curve  for  TICs.  (Strictly  speaking,  for  this  inter- 
pretation to  be  legitimate,  the  test  score  must  be  shown  to  be  an  unbiased 
estimate  of  0.) 

11.6  The  SIC  is  plotted  by  a computer  program  available  from  the  Educa- 
tional Testing  Service  (see  Chapter  15),  and  may  be  derived  from  a 
program  by  John  Gugel  (see  Section  15.4). 


80 


CHAPTER  12 


Maximum  Likelihood  Estimation  of  8 

12.1  There  are  two  main  ways  in  IRT  to  estimate  an  examinee's  8. 

They  are  called  the  Maximum  Likelihood  Estimation  method  and  the 
Bayesian  Modal  Estimation  method.  Both  methods  use  the  actual  re- 
sponse pattern  of  the  examinee  rather  than  the  raw  score.  The  differ- 
ence between  the  two  methods  is  merely  an  additional  assumption  made  by  the 
Bayesian  method. 

12.2  A response  is  indicated  by  the  lower  case  letter  u.  If  the  examinee 
gets  item  i correct,  then  u^=l,  and  if  he  gets  it  wrong,  then  u^O.  A 
response  pattern  is  also  called  a response  vector,  and  is  represented  by 
the  uppercase  letter  U.  A response  pattern  is  a list  of  zeroes  and  ones, 
indicating  which  questions  the  examinee  got  correct  or  wrong  in  the  order 
the  items  appear  in  the  test.  For  example;  in  a four-item  test,  an  exam- 
inee who  got  the  first  two  items  correct  and  the  last  two  wrong  would  have 
a response  pattern  U = 1100.  If  he  got  the  first  and  third  items  correct 
and  the  other  two  items  wrong,  his  response  pattern  would  be  U = 1010.  If 
he  got  the  first  three  wrong  and  the  last  item  correct,  he  would  have  a 
response  pattern  U = 0001. 

12.3  We  recall  that  P.j(8)  is  the  probability  that  an  examinee  with 
ability  8 will  get  item  i correct.  Qj(8)  is  the  probability  that  an 
examinee  with  ability  8 will  get  item  i wrong.  Qn*  (8)  = 1-  P1-  (8).  We  will 
abbreviate  P^Q)  and  Q^e)  by  Pi  and 


12.4  Probability  theory  tells  us  that  the  probability  of  independent 
events  occurring  together  is  equal  to  the  product  of  their  separate 
probabilities.  We  know  that  the  probability  of  getting  one  item 
correct  or  wrong  is  independent  of  the  probability  of  getting  other 
iters  correct  or  wrong  for  any  given  value  of  6.  We  know  this  because 
of  the  assumption  of  local  independence.* 


12. C Therefore,  the  probability  of  an  examinee  retting  item  1 correct 
and  item  2 wrong  is  P - The  probability  of  getting  both  items  wrong 
is  QiQ?.  Getting  item  1 correct  and  item  2 wrong  is  the  response 
pattern  U=10.  Therefore,  P(U=10)=P1Q2,  P(U=00)=Q1Q2,  P(U=01)=Q1P2, 
and  P(U=ll)=PlP„. 

1 L. 

Similarly,  for  three  items  for  a given  9,  if: 

P1  = .3  Qx  = .7 

P0  = .6  09  = .4 


*The  assumption  of  local  independence  will  be  discussed  in  Sec.  14.3. 


I 


L 


then 


U_  L(U|Q)  = Likelihood  7/f  P^Q)  “ 


000  Q/Q2Q3  = .7  x .4  x .2  = .056 


001  Q;Q2P3  = .7  x .4  x .8  = .224 

010  Q,P2Q3  = .7  x .6  x .2  = .084 

100  P,  Q2Q3  = .3  x .4  x .2  = .024 

011  Q/P2P3  = .7  x .6  x .8  = .336 

= .3  x .4  x .8  = .096 
= .3  x .6  x .2  = .036 


101  p,  q2p3 


110  p,p2q3 


111  P,  P P3  = .3  x .6  x .8  = .144 


Table  12.5 


The  likelihood  of  each  possible  response  pattern  for  a 
given  6 where  the  P, (0)  is  as  given  in  Section  12.5. 


12.6  These  probabilities  are  called  likelihoods  (and  written  L (U f 0) ) . 

Each  likelihood  is  the  conditional  probability  of  a response 
pattern  (U)  given  9,  i.e.  L(U|0).  The  general  formula  for  a like- 
lihood is 


L(U©)=Trn  PiUQ|l 

1 — I 


1 = 1 
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The  upper  case  Greek  letter  7?^means  the  product  of  all  the  W* 

where  i goes  from  1 to  n (n  = the  # of  items  in  the  test),  just  as, 

__n 

in  statistical  notation  means  the  sum  of  a series  of  numbers 
where  i goes  from  1 to  n. 


When  u.  = l 

p“  q't  w=  v;  o'-  '= p'q° 

When  Uj  = o 

£>«  Qi-'*-  'p.' °Qj . pfQ'j  = I . Q . Qi 

When  u^  = j,  the  drops  out,  and  when  u^  = 0,  the  drops  out. 

Thus,  9UQ'-1  "'is  just  a convenient  mathematical  way  of  getting  rid  of 
the  P or  Q depending  on  the  value  of  u^. . For  a three-item  test  the 
likelihood  of  U - Oil, 


l (u=on  |e)= ^ 


PiUQi,U 


1 = 1 


=puq''u 

I I 


pUQl_U 

2 2 


u l-u  0 1*0  l J-|  J.l-I 
P3Q3‘PIQI  •'zVW 


'P?qI|  ' P2Q2 


P^ 


Q,P2P3 
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#1  ' n * 3 


0 ’ 

_1_ 

A 

Jx 

3 

P3 

3 

1 

3.0  , 

1 

.29 

.71  1 

.36 

.64  ‘ 

.21 

.79 

2.5 

.32 

.68  | 

.39 

.61  | 

.22 

.78 

2.0 

.37 

.63  | 

.45 

.55 

.25 

.75 

1 

1.5 

.50 

.50  1 

.60 

.40 

.30 

.70 

1.0 

.62 

.38 

.77 

.23 

.38 

.62 

0.5 

.77 

.23  , 

.90 

.10 

.50 

.50 

1 

0.0 

.88 

.12 

1 

.97 

I 

.03  ' 

.59 

.41 

0.5 

1 

.93 

.07 

1 

.99 

.01 

\ * 70 

.30 

1.0  i 

1-97 

.03  i 

.99 

.01 

.79 

.21 

1.5 

.98 

.02  , 

1 

.99 

.01 

00 

.13 

2.0 

.99 

i 

.01 

.99 

.01 

.91 

.09 

2.5 

( 

.99 

.01 

.99 

.01 

.95 

.05 

( 

i 


L(U=Q10l9) 



ilBLiil 

.71  x .36  x .79 

= .202 

.169 

.68  x .39  x .78 

= .207 

.173 

.63  x .45  x .75 

= .213 

.178 

. 50  x . 60  x . 70 

= .210 

.176 

.38  x .77  x .62 

= .181 

.151 

.23  x .90  x .50 

o 

II 

.087 

.12  x .97  x .41 

= .048 

o 

O 

.07  x .99  x .30 

= .021 

CO 

o 

.03  x .99  x .21 

= .006 

.001 

.02  x .99  x .13 

= .003 

.000 

.01  x .99  x .09 

= .000 

.000 

.01  x .99  x .05 

II 

o 

o 

o 

,000 

£L(U|0)  : 

* 1.195 

1.000 

Table  12.7 

The  method  of  calculating  the  Maximum  Likelihood 
Estimate  of  6 from  a test  of  3 items  for  an  examinee 
with  the  response  pattern,  U = 010. 
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12.7  l/hen  we  give  a test,  we  get  each  examinee's  response  pattern, 
and  we  want  his  9.  L ( U| 0 ) is  not  what  we  want,  since  we  already  have 
U.  What  would  he!p  us  estimate  an  examinee's  9 is  just  the  reverse, 
i.e.  L ( 9 / U ) . 

Fortunately,  Bayes'  Theorem  allows  us  to  get  L (9 |U)  from  L ( U| 9 ) . 

L(u[e) 

SL(U|0) 

To  use  Bayes'  Theorem  we  have  to  get  the  L(L'|9)  at  several  points  on 
the  9 scale.  How  many  points  we  use  is  determined  by  how  accurately 
we  want  to  estimate  9. 

To  show  how  this  is  done,  L(U=010|  9)  is  calculated  in  Table  12.7 
for  three  hypothetical  items  at  12  values  of  9. 

The  total  of  the  L(u|e)s  isZL(uie)  . The  right  column  shows 
L(e)u)=L(u|0)/£L(u|e).  Any  examinee,  no  natter  what  his  9,  could 
conceivably  have  a U = 010  in  this  three-item  test.  There  is  a finite 
probability  of  O'  = 010  at  every  9. 

However,  the  likelihood  of  an  examinee  having  U = 0 10  varies 
considerably  v:i th  9.  An  examinee  with  9^.0. 0 is  unlikely  to  have 
b = 010.  In  fact,  only  6%  of  examinees  with  9 ^.0.0  will  have  U = 010. 


L(9|U) 


Note:  The  proponents  of  Maximum  Likelihood  Estimation  do  not  agree 
with  the  use  of  Bayes'  Theorem  in  this  explanation. 


i 
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A graph  of  the  likelihoods  (for  U = 010)  would  look  like  Figure 

12.7 


Figure  12.7.  The  graph  of  the  likelihoods  in  Table 
12.7,  called  the  likelihood  function. 


This  curve  is  called  the  likelihood  function. 

If  you  had  to  guess  the  9 of  an  examinee  with  L)  = 010,  what  0 
would  you  guess  from  the  information  in  Table  12.7?  You  should  guess 
his  9 = -2.0  because  the  likelihood  of  U = 010  is  greater  at  9 = -2.0 
than  at  any  other  9.  Therefore,  you  would  be  right  more  often  than  if 
you  guessed  any  other  9.  By  choosing  the  0 with  the  greatest  likeli- 
hood, you  have  chosen  the  0 with  the  maximum  likelihood.  And  that  is 
the  Maximum  Likelihood  method  of  estimating  9!  That's  all  there  is  to 
it. 
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Now  look  at  the  L(lj9)  colur.n.  At  which  value  of  9 is  L ( li| 9 ) 
greatest?  It  is  at  9 = -2.0,  the  same  as  the  9 with  the  maximum  L(9 I 
U).  That  will  always  be  the  case  because  the  L ( 9 1 U ) ' s are  just  the 
L ( L'| 9 ) 1 2 3 4 s divided  by  the  constant  £L(u|e).  So  the  9 with  the  maximum 
L ( 9 1 U ) will  always  be  the  same  as  the  9 with  the  maximum  L ( L' / 9 ) . 
Therefore,  it  is  not  necessary  to  divide  by  DL(  1)9)  in  order  to  find 
the  9 with  the  maximum  likelihood. 

Since  we  divided  by  J L ( U| 9 ) in  order  to  apply  Bayes1  Theorem, 
we  find  that  Bayes1  Theorem,  is  not  necessary  for  r.aximum  likelihood 
estimation. 

Another  short  cut  is  to  take  the  logarithm  of  the  P.  and  Q^- ' s 
and  add  them,  instead  of  multiplying  the  P ^ 1 s and  1 s . The  sum  of  the 

logarithms  will  also  always  be  maximum  at  the  same  value  of  9.  A grapii 
of  the  log  likelihoods  is  called  the  log  likelihood  function.  The  log 
likelihood  function  will  always  be  highest  at  the  same  9 at  which  the 
likelihood  function  is  highest. 

It  should  be  noted  that,  in  this  example,  you  would  be  right 
in  estimating  9 = -2.0  only  17.8%  of  the  time  and  wrong  82.2%  of  the 
time.  But  this  is  true  only  because  the  test  had  only  three  items. 

With  a longer  test  there  would  be  one  9 at  which  the  likelihood  is 
much  greater  than  any  other. 

12.8  Table  12.8  shows  the  maximum  likelihood  method  of  estimating 
9 for  a test  made  of  the  four  items  whose  IRF's  are  shown  in  Figure 
6.17. 

(1)  across  the  top  are  17  values  of  9 

(2)  under  the  9 1 s are  the  P(9)'s  for  each  of  the  four  items. 

(3)  the  item  numbers  and  parameters  are  in  the  top  left  corner. 

(4)  down  the  left  side  are  the  16  possible  response  patterns  for 

four  items  and  the  raw  (#  right)  score  represented  by  the  response 

patterns. 
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Aii  illustration  of  the  MLE  of  8 for  all  possible  re- 
sponse patterns  from  a test  composed  of  four  real 
items.  (All  likelihoods  are  multiplied  by  1000  to 
reduce  decimal  values). 


r i 

(5)  in  the  body  of  the  table  are  the  L(Uf0)'s  for  each 
possible  U for  the  17  values  of  6.  Each  L ( U | 9 ) is 
multiplied  by  1000  to  eliminate  decimal  values. 

(6)  underlined  in  each  row  is  the  maximum  L ( Ll( 9 ) 

(7)  down  the  right  side  are  the  values  of  9 where  the 
underlined  maximum  likelihoods  occur.  These  0‘s  are  the 
maximum  likehood  estimates  (MLE)  of  0 for  each  of  the  16 
possible  U. 

Note  that  the  MLE  for  U = 0000  is  - oo,  and  the  MLE  for  U = 1111 
is  + oo.  That  is  a characteristic  of  the  MLE.  The  MLE  will  not  give  a 
finite  estimate  of  0 unless  the  examinee  has  missed  at  least  one  item 
and  answered  at  least  one  item  correctly.  This  limitation  is  not 
serious  because  raw  scores  of  0%  or  100%  are  usually  rare. 

The  MLE  of  0>2.7  is  due  to  the  limited  range  of  9 used  in  this 
example.  A larger  range  of  0 would  yield  a more  precise  MLE  of  0. 

The  many  cells  with  L(U|0)  = 0 in  the  body  of  Table  12.8  are  due 
to  the  very  unusual  item  #17. 

12.9  Now  compare  in  Table  12.8  the  raw  scores  on  the  left  with  the 
MLE's  on  the  right.  You  can  see  that  a raw  score  of  1 represents 
0s  from  -2.3  to  +2.0,  an  extreme  range!  A raw  score  of  2 represents 
9s  from  -1.3  to  greater  than  +2.7.  A raw  score  of  3 represents  0's 


from  +1.3  to  greater  than  +2.7. 

The  extreme  range  of  0,  depending  on  the  U's  represented  by  a 
single  raw  score,  demonstrates  well  the  inadequacy  of  using  raw 
score  as  an  estimate  of  ability.  The  inadequacy  of  raw  score  as  an 
estimate  of  ability  is  due  to  the  fact  that  raw  score  cannot  dis- 
tinguish chance  success  from  knowledge  success  on  an  item.  In 
contrast,  the  MLE  takes  guessing  into  account  by  using  the  additional 
information  in  the  response  pattern. 
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CHAPTER  13 


Bayesian  Modal  Estimation  of  9 

13.1  The  Bayesian  Modal  method  of  estimating  9 takes  up  where  the  MLE 
stops.  The  proponents  of  the  Bayesian  Modal  method  (called  Bayesians) 
reason  that  if  the  distribution  of  9 is  known  or  assumed,  then  that 
knowledge  or  assumption  provides  additional  information  which  can  be 
used  to  more  accurately  estimate  9. 

13.2  Bayesians  assume  that  9 is  distributed  normally.  The  assumption 
of  normality  means  that  the  probability  of  any  randomly-chosen  examinee 
having  a 9 at  the  extremes  is  less  than  his  probability  of  having  a 

0 located  near  the  mean.  The  assumption  of  normality  is  made  on  an  a 
priori  basis  (i.e.  before  empirical  evidence).  Thus,  it  is  called  the 
normal  "prior"  distribution. 

13.3  Suppose  the  likelihood  of  9jJu  is  very  close  to  the  likelihood  of 

®2(b',  but  that  there  are  many  more  examinee's  at  92  than  at  9j.  In 
this  case  we  would  be  right  more  often  by  estimating  9 at  0£  than  at 

9j.  In  doing  so  we  would,  in  effect,  be  weighting  our  likelihood  by 
the  number  of  examinees  at  the  two  9 values.  If  we  take  this  idea  to 
its  logical  extreme,  we  should  weight  all  likelihoods  by  the  proportion 
of  examinees  at  each  value  of  9 in  order  to  reduce  our  errors. 

13.4  By  assuming  a normal  distribution  of  9 we  can  weight  the  like- 
lihood by  the  relative  proportions  of  area  under  the  normal  curve. 

To  do  this  we  merely  multiply  the  area  within  the  interval  of  the  normal 
curve  at  9,  designated  ^N(0,1),  times  L(U|0).  Table  13.4*shows  how  this 
is  done  using  the  likelihoods  from  Table  12.8. 

*There  are  several  computational  errors  in  Table  13.4.  However, 

These  errors  do  not  affect  the  explanation  of  the  conceDts  involved. 
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An  illustration  of  the  Bayesian  Modal  Estimate  of  8 
for  all  possible  response  patterns  from  a test  com- 
posed of  four  real  items.  ( All  likelihoods  are  multiplied 
byMO.OOO  to  reduce  decimal  values). 
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NATIONAL  BUREAU  OF 


using  the  likelihoods  from  Table  12. 8. 


(1)  the  top  row  are  points  of  0 which  are  midpoints  of 
intervals  of  0. 

(2)  the  2nd  and  3rd  rows  are  the  limits  of  the  intervals. 

(3)  the  4th  row  is  the  proportion  of  area  under  the  normal 
curve  and  within  the  interval. 

| 


(4)  in  the  body  of  the  table  each  column  is  the  area  in  the  4th 
row  multiplied  by  the  corresponding  likelihood  from  Table  12.8 
(times  1 Cl?  000  to  remove  decimal  values),  i.e.  L(U|0)  xy^N(0,l)), 

(5)  the  largest  value  in  each  row  is  underlined. 

(6)  the  0 for  the  underlined  likelihoods  are  in  the  right 
column.  These  are  the  Bayesian  Modal  Estimates  (BME)  of  0. 

The  BME  is  called  modal  because,  when  we  choose  the  largest  value 
in  each  row,  we  are  choosing  the  mode  of  the  distribution  of  L ( U | 9)  x 

13.5  Bayesian  Modal  Estimates  are  more  conservative  than  MLEs  (con- 
servative means  closer  to  zero,  the  mean  of  the  normal  prior  distri- 
bution). Note  that  with  11=0000  and  U=1 1 1 1 , the  BMEs  of  0 are 
finite.  The  finiteness  of  0 estimates  of  BME  when  either  all  or 
no  items  are  answered  correctly  is  a minor  advantage  of  BME. 
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13.6  There  is  an  active  controversy  between  the  Bayesians  and  the 
proponents  of  the  MLE.  The  Bayesians  argue  that  MLE  is  the  same  as 
a BME,  if  9 is  assumed  to  be  distributed  rectangularly.  (A  rectan- 
gular distribution  of  9 means  that  there  are  equal  numbers  of  exam- 
inees at  all  9 values,  even  at  +oo  and  -oo).  And  so,  say  the  Bayesians, 
since  a normal  distribution  of  9 is  more  reasonable  to  assume  than  a 
rectangular  distribution,  the  BME  is  a more  accurate  estimate  of  9. 

The  proponents  of  MLE  argue  that  the  coincidence  of  the  MLE 
(which  assumes  no  distribution  of  9)  being  the  same  as  a BME  with 
rectangular  distribution  is  irrelevant.  The  important  thing  is  that 
MLE  makes  no  assumption  about  the  distribution  of  9,  whereas  BME  makes 
the  additional  assumption,  which  will  be  sometimes  false.* 

13.7  I shall  not  take  sides  in  this  matter,  because  for  me  the  point 
is  moot.  The  only  computer  program  available  to  me  at  present  is 
OGIVIA-3  (See  Chap.  15),  which  uses  the  BME.  Therefore,  I shall 
continue  to  use  BME  until  I have  a program  which  uses  MLE.  At  that 
time  I shall  have  to  make  a decision. 

13.8  Another  type  of  Bayesian  estimation  is  called  Owen's  Bayesian, 
after  its  inventor,  R.  L.  Owen  (1975).  The  Owen's  Bayesian  method 
is  used  primarily  in  tailored  testing  (See  Chap.  17). 


★I  apologize  to  both  sides  of  this  complex  issue  for  this  meager 
representation  of  their  positions. 
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CHAPTER  14 
Assumptions 


I 

14.1  There  are  4 basic  assumptions  of  IRT.  The  first  of  these  is  a 
minor  assumption.  It  is  an  assumption  of  any  test  theory  and  without 
which  there  would  be  no  justification  for  testing. 

Assumption  #1:  The  Know-Correct  Assumption:  if  the  examinee 
knows  the  correct  answer  to  the  item,  he  will  answer  it  correctly.* 

We  have  probably  all  violated  this  assumption  while  taking  tests  by 
marking  a different  choice  than  we  intended  to  mark.  Occasionally, 
an  examinee  will  inadvertently  skip  an  item,  and  then  mark  all  the 
rest  of  his  answers  in  the  wrong  places.  This  is  merely  a clerical 
error,  but  there  is  no  provision  for  it  in  any  test  theory.  Another 
way  to  state  the  first  assumption  is:  if  he  got  the  item  wrong, 
then  he  did  not  know  the  answer. 

14.2  Assumption  # 2 : The  Normal  Ogive  Assumption:  The  IRF  takes  the 
form  of  the  normal  ogive.  This  is  the  problem,  mentioned  in  Section 
3.3,  which  deterred  Lord's  work  for  10  years.  The  difficulty  lay  with 
3 parts  of  the  IRF. 

a.  The  lower  asymptote 

b.  The  upper  asymptote 

c.  The  middle  or  rapidly  rising  part  of  the  IRF 


★The  reader  should  take  careful  note  that  the  inverse  of  this  assump- 
tion is  NOT  made.  That  is,  it  is  NOT  ASSUMED  that  if  the  examinee 
gets  the  item  correct* he  knows  the  answer.  I emphasize  this  distinc- 
tion because  many  persons  upon  firs1-  reading  of  assumption  #1  misread 
it  as  its  inverse. 
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(1)  As  previously  noted,  the  c-value  of  an  IRF  is  often  not 
1/A.  This  is  the  case  with  observed  parts  of  the  lower  asymptote. 

But  what  about  the  unobserved  parts?  If  an  item  from  the  SAT  with 
c = .09  were  given  to  extremely  low  9 persons  such  as  kindergarten 
children  or  mentally  retarded  persons,  would  the  lower  tail  of  the 
IRF  rise  to  1/A? 

(2)  It  has  been  charged  by  Hoffman  (1962),  that  tests  may 
penalize  extremely  high  ability  persons,  because  they  know  too  much. 
That  is,  they  consider  factors  far  beyond  the  intended  scope  of  the 
item,  and  therefore  get  it  wrong.  If  that  were  the  case,  then  the  IRF 
would  curve  down  away  from  the  upper  asymptote  at  high  0's.  This  has 
been  called  the  Banesh  Hoffmann  Effect. 

(3)  It  was  not  known  that  the  IRF  was  monotonic,  and  that  its 
general  shape  was  that  of  a normal  ogive. 

In  1965  Lord  published  a massive  study  with  a sample  size  greater 
than  100,000.  Specifically,  he  found: 

a.  the  lower  tail  of  the  IRF  did  not  rise  for  almost  all  items. 
The  very  few  items  that  did  rise,  did  so  to  a very  small 
extent. 

b.  no  evidence  of  the  Banesh  Hoffman  Effect. 

c.  good  indications  that  the  IRF  is  strictly  monotonic. 





14.3  Assumption  #3:  Local  Independence.  Local  independence  means 
that  the  probability  of  an  examinee  getting  an  item  correct  is  un- 
affected by  the  answers  given  to  other  items  in  the  test.  Local 
independence  does  NOT  mean  that  the  items  correlate  zero  with  each 
other. 

The  most  comnon  situation  where  local  independence  does  not  hold 
is  in  a speeded  test.  In  a speeded  test  an  examinee  may  get  the  last 
items  wrong,  simply  because  he  did  not  reach  them.  A distinction  is 
made  between  not-reached  items  and  omitted  items.  Not-reached  items 
are  those  unanswered  items  which  have  no  answered  items  after  them 
in  the  sequence  of  items  in  the  test.  Omitted  items  (omits)  are  un- 
answered items  which  have  at  least  one  item  answered  after  them  in 
the  sequence  of  items  in  the  test.  This  distinction  is  important,  when 
deciding  what  to  do  with  not-reached  items  and  omits  in  scoring  answer 
sheets.  Not-reached  items  are  not  attempted  (and  hence  there  is  no 
possibility  of  being  correctly  answered)  simply  because  of  the  pre- 
sence of  the  early  items,  which  were  attempted  during  the  time  limit. 

Furthermore,  earlier  items  which  v/ere  attempted  may  have  been 
missed,  because  the  examinees  felt  rushed  and  could  not  give  their 
best  efforts  to  the  items. 

Similarly,  in  long  tests,  fatigue  effects  may  inpact  the  local 
independence  of  items. 

Certain  reading  comprehension  tests  might  violate  local  inde- 
pendence when  several  items  are  all  based  upon  some  common  reading 
passage.  However,  it  is  not  entirely  clear  whether  such  items  violate 
local  independence. 
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Chain  items  violate  local  independence.  An  example  of  chain 
items  follows: 

(1)  Who  discovered  America? 

(2)  Where  was  he  born? 

Clearly,  if  the  first  item  were  not  in  the  test,  the  second  item  would 
be  meaningless.  Fortunately,  chain  items  are  rare. 

Local  independence  also  means  that  items  are  uncorrelated  for 
individuals  with  the  same  9.  This  interpretation  suggests  a statis- 
tical test  for  local  independence.  (Lord,  in  preparation,  p.  26). 

rgh  ld  = °»  9*h 

where  r^|Q  = the  tetrachoric  correlation  between  items  g and  h 

for  examinees  with  exactly  the  same  ability. 

To  use  this  statistical  test  requires  that  first  it  is  necessary  to 
get  a large  number  of  examinees  with  identical  9 1 s . Then,  using  their 
responses,  calculate  the  interitem  tetrachoric  correlation.  That 
correlation  should  not  be  significantly  different  from  zero.  This 
procedure  has  at  least  2 practical  difficulties. 

First,  it  should  be  done  for  all  (or  at  least  several)  values  of 
9.  It  is  nearly  impossible  to  get  large  sample  sizes  at  many  9 
values. 

Second,  it  must  be  done  for  all  pairs  of  items,  which  would 
require  calculation  of  n(n-l)/2  tetrachoric  correlations  (n  = # of 
items  in  the  test)  for  each  value  of  9.  A 50-item  test  would  require 
1225  correlations  at  each  9 value.  If  10  9 values  were  chosen,  that 
would  mean  12,250  correlations. 
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A similar  but  simpler  procedure  would  be  to  partial  out  of  the 
interitem  correlations  the  affect  of  0.  This  may  be  done  by  using 
the  item-test  biserial  correlation.  Then 


Ygh-0  =- 


Ygh  - XgQ  The 
/l-  V'2g0  Vl-Y2h( 


where  r ^ = the  tetrachoric  correlation  among  alj_  examinees  between 
items  g and  h ( g/h ) , and  rgg  = the  biserial  correlation  between  item 
9 and  9.  rgh(g  should  not  be  significantly  different  from 


zero. 


Before  using  this  test  of  local  independence,  care  should  be 
taken  that  the  implicit  assumptions  of  the  statistics  involved  are 
satisfactori ly  met.  In  any  case  it  should  only  be  considered  as  a 
rough  estimate. 

This  latter  procedure  would  require  n(n-l)/2  tetrachoric  co- 
rrelations plus  n biserial  correlations  (which  are  usually  available 
anyway). 

Because  of  the  practical  rarity  of  conditions  violating  local 
independence,  this  assumption  is  usually  not  tested. 

14.4  Assumption  #4:  Unidimensionality.  The  assumption  of  unidimen- 
sionality is  the  most  complex  and  most  restrictive  assumption  of  IRT. 

In  general,  unidimensionality  means  that  the  items  measure  one  and 
only  one  area  of  knowledge  or  ability.  However,  unidimensionality 
does  NOT  mean  that  the  items  correlate  positively  with  each  other.  In 
fact,  it  is  conceivable  for  all  items  to  correlate  negatively  with 
each  other  and  still  be  unidimensional. 

As  a rule  of  thumb,  tests  that  look  unidimensional  probably  are 
unidimensional . Thus,  typical  ability  tests,  such  as  verbal,  numerical, 
spatial  perception,  mechanical  comprehension  and  tool  knowledge  are 
probably  unidimensional. 


Another  rule  of  thumb  is,  items  that  test  bits  of  knowledge 
that  were  learned  together  are  probably  unidimensional.  Thus,  a final 
examination  for  a college  course  might  be  considered  unidimensional. 

An  excellent  example  of  this  rule  is  given  by  Bejar,  Weiss,  and 
Kingsbury  (1977).  That  study  involved  a test  in  college  introductory 
biology.  Part  of  the  course  was  covered  by  a test  divided  into  3 
content  areas,  called  "Chemistry,"  "The  Cell,"  and  "Energy."  The 
single  test  for  all  3 content  areas  was  found  to  be  essentially 
unidimensional . 

Unidimensional ity  in  a test  covering  3 such  diverse  sounding 
content  areas  is  surprising.  The  fact  of  its  unidimensionality  may 
have  resulted  from  the  items  testing  bits  of  knowledge  which  were 
learned  together  in  the  college  course. 

It  may  well  have  been,  however,  that  the  subject-matters  of  the 
3 content  areas  were  not  as  diverse  as  they  sound.  It  is  likely  that 
"Chemistry"  was  the  chemistry  necessary  to  understand  the  cell.  And 
"The  Cell"  content  was  necessary  to  understand  the  "Energy"  use  and 
transfer  within  the  cell.  This  possibility  suggests  another  rule  of 
thumb.  Items  that  test  bits  of  knowledge  which  are  logically  and 
sequentially  related  may  be  expected  to  be  unidimensional. 

Rules  of  thumb  are,  by  definition,  sometimes  erroneous.  I do  not 
► suggest  that  they  replace  efforts  to  empirically  verify  unidimension- 

ality. However,  in  view  of  the  difficulty  of  empirical  verification, 
some  readers  may  find  them  helpful. 

■r  i 

f -| 

I 

I 

\ 


102 


14.5  There  is  no  completely  satisfactory  test  for  uni  dimensionality 
among  multiple-choice  items.  The  reason  for  this  situation  is  that 
most  tests  for  unidimensionality  involve  factor  analysis  of  interitem 
tetrachoric  correlations.  Unfortunately,  the  tetrachoric  correlation 
assumes  9 is  normally  distributed,  and  is  not  entirely  appropriate  when 
c ^ 0;  i.e.,  when  the  item  can  be  correctly  answered  by  guessing. 
Cristofferson  (1975)  has  made  the  best  attempt  to  develop  a test  for  unidimen- 
sionality(Lord,  in  preparation.  Section  2.4,  p.27).  However,  the' 
mathematics  of  his  method  are  complex  and  will  not  be  discussed. 

I have  found  3 methods  of  testing  for  uni  dimensionality  in  the 
literature.  Six  of  the  eight  use  factor  analysis.  To  avoid  repe- 
tition, the  initial  factor  analysis  steps  which  are  conmon  to  all 
six  will  be  described. 

(1)  convert  the  actual  responses  of  examinees  into  zeroes  and 
ones;  zero,  if  the  response  is  wrong,  and  one,  if  the  response  is 
correct.  Factor  analysis  requires  a sample  10  times  the  # of  items 
(N  = lOn); 

(2)  calculate  a matrix  of  interitem  tetrachoric  correlations 
(not  the  phi  coefficient),  using  the  zero-one  responses; 

(3)  replace  each  value  in  the  diagonal  with  the  correlation 

in  its  row  that  has  the  largest  absolute  value  (most  factor  analysis 
computer  programs  have  an  option  to  do  this  automatically).  If  there 
are  too  many  items  for  the  capacity  of  the  computer,  a random  sample 
of  items  may  be  used; 

(4)  do  a principal  component  (or  principal  axis)  factor  analysis 
for  the  first  9 factors  (9  is  an  arbitrary,  typical  number). 
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14.6  I have  given  short  titles  for  easy  reference  to  each  of  the  tests 
for  uni  dimensionality : 


(1)  The  Eigenvalue  Test.  Plot  the  eigenvalues  of  the  nine 
factors  against  the  factor  rank,  as  shown  in  Figures  14.6(l)a  and 
14 . 6 ( 1 )b.  The  items  may  be  considered  unidimensional  if  the  eigenvalue 
of  the  first  factor  is  large  compared  to  the  second  factor,  and  the 
eigenvalues  of  the  remaining  factors  are  all  about  the  same.  The  graph 
should  look  something  like  Figure  14. 6( l)a  if  the  items  are  unidi- 
mensional, and  like  Figure  14. 6 ( l)b  if  the  items  are  not  unidimen- 
sional. (Lord  and  Novick,  1968,  p.283). 


f 

(2)  The  Random  Baseline  Test.  This  test  is  a variation  of  the 
Eigenvalue  Test.  It  is  necessary  to  do  the  Eigenvalue  Test  first. 

To  get  the  random  baseline,  create  with  a random  generator  a matrix 
of  zeroes  and  ones  of  the  same  order  as  the  matrix  in  step  (1)  of 
Section  14.5.  Then  perform  steps  (2),  (3),  and  (4)  just  as  with  the 
Eigenvalue  Test.  Plot  the  eigenvalues  from  the  random  data  on  the  same 
graph  as  the  Eigenvalue  Test.  Unidimensionality  is  indicated  if  only 
the  first  factor  eigenvalues  are  distinguishable  for  the  2 sets  of  data 
(McBride  and  Weiss,  1974, p. 30).  See  Figures  14.6(2)a  and  14.6(2)b. 


(3)  The  Biserial  Test.  Compute  the  correlation  between  the 
item- test  biserial  correlation  and  the  item  first  factor  loading. 

A high  (.80  or  higher)  correlation  supports  the  assumption  of  uni- 
dimensionality. (McBride  and  Weiss,  1974, p.31,33  and  37). 

(4)  The  Factor  Loading  Test.  Unidimensionality  is  indicated 
if  the  first  factor  loadings  for  all  items  are  significant  and  have 
the  same  sign  (+  or  -).  (McBride  and  Weiss,  1974, p. 33). 
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Figures  14.6(1)  a and  b.  A hypothetical  illustra- 
tion of  the  Eigenvalue  Test  for  unidimensionality. 
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(5)  The  Congruence  Test.  If  the  examinees  can  be  separated 
into  two  meaningful  subgroups  (Black/White,  male/female),  then  a factor 
analysis  (steps  (1)  to  (4)  in  Section  14.5)  can  be  performed  on  each 
group.  The  Coefficient  of  Congruence  (Cftg)  of  the  item  first  factor 
loadings  between  the  2 groups  w'll  approach  zero,  giving  evidence  of 
unidirensional i ty  with  respect  to  the  variable  on  which  the  groups  were 
defined,  i.e.  race,  sex.  (Pine,  1977, p. 4).  (See  Ronmell,  1970, p. 461 
for  Coefficient  of  Congruence). 


L.  = loading  of  item  i for  group  A on 
ia  the  1st  factor 

L..  = loading  of  item  T for  group  B on 
1D  the  1st  factor 
n = number  of  items  in  the  test 
C.  R = Coefficient  of  Congruence  between 
groups  A & B 


(6)  The  Communality  Test.  It  has  been  suggested  that  unidimen- 
sionality  nay  be  tested  by 


G 


where  r..  = interitem  tetrachoric  correlation 
• J 

2 

lu  = the  item  communality 
n = the  number  of  correlations 

According  to  Green,  et  al  (1977,  p.C36)  this  function  which  I have 
designated  G,  approaches  1.00  as  dimensionality  approaches  unity. 
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I have  applied  G to  data  published  in  McBride  and  Weiss  (1974). 
This  data  gives  item  commonalities  and  interitem  tetrachoric  correla- 
tions for  six  real,  word  knowledge  tests,  and  six  sets  of  random  data. 
The  six  real  tests  were  found  by  McBride  & Weiss  to  be  essentially  uni- 
dimensional by  three  different  measures  of  unidimensionality,  i.e. 
the  Random  Baseline  Test  (see  14.6(2)  above),  the  Biserial  Test  (see 
14.6(3)  above),  and  the  Factor  Loading  Test  (see  14.6(4)  above).  G for 
the  real  tests  ranged  from  .419  to  .484,  and  had  a Spearman  rank 
correlation  with  the  first  factor  percent  of  common  variance  of  rho  = 
1.00.  On  the  random  data,  G ranged  from  .284  to  .348,  and  had  a Spear- 
man rank  correlation  with  the  first  factor  percent  of  common  variance 
of  rho  = .60.  It  appears  that  when  c f 0,  G approaches  neither  one 
(for  unidimensionality)  nor  zero  (for  nonuni  dimensionality).  Further- 
more, G is  no  better  as  an  indicator  of  uni  dimensionality  than  is 
the  first  factor  percent  of  common  variance. 

(7)  The  Part/Whole  Test.  If  the  items  may  be  separated  into 
distinctive  types  or  content,  the  a and  b values  may  be  estimated 
separately  for  each  type  and  for  the  entire  test.  If  the  parameter 
estimates  under  the  two  conditions  (part  vs.  whole)  correlate  highly, 
unidimensionality  is  supported  (Bejar,  1977(b),  p . 13 ) . 

(8)  The  Vector  Frequency  Test.  Assuming  9 is  normally  distri- 
buted, and  given  the  item  parameters,  it  is  possible  to  calculate 

the  expected  frequency  of  all  possible  response  patterns.  A comparison 
with  the  observed  frequency  of  all  possible  response  patterns  will 
yield  a non-significant  chi-square,  if  uni  dimensionality  is  present 
(Bock  and  Lieberman,  1970). 

14.7  Unidimensionality  is  a sufficient  condition  for  local  independ- 
ence. That  is,  if  you  have  uni  dimensionality,  then  you  also  always  have 
local  independence.  The  reverse  is  not  true.  Local  independence  is 
necessary  for,  but  does  not  guarantee,  uni di mens i ona 1 i ty. 
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CHAPTER  15 


Computer  Programs 

15.1  There  are  several  computer  programs  available  for  estimating 
examinees'  0s  and  item  parameters.  Only  2 of  those  are  in  general 
use.  Both  are  written  in  FORTRAN. 

15.2  LOGIST  was  written  at  the  Educational  Testing  Service  and  is 
the  program  used  by  Lord  for  his  work  (See  Wood  et  al,  1976).  The 
LOGIST  and  related  programs  provide  a comDlete  set  of  options  for  calcu- 
lating and  printing: 

a.  examinee's  0, 

b.  item  parameters, 

c.  item  response  curves, 

d.  test  characteristic  curve, 

e.  item  information  function, 

f.  score  information  curve,  and 

g.  relative  efficiency  of  2 tests. 

LOGIST  allows  either  examinee's  0 or  item  parameters  as  fixed 
input,  and  puts  all  other  estimates  on  the  same  scale  as  the  input 
parameters.  It  is  by  far  the  more  versatile  program.  Lord  recommends 
that,  to  get  good  estimates,  at  least  1000  examinees  and  30  items  are 
needed  in  the  test. 

However,  LOGIST  has  one  practical  disadvantage.  It  requires  from 
30  minutes  to  two  hours  of  computer  CPU  (Central  Processing  Unit)  time. 
Consequently,  I was  unable  to  convince  my  data  processing  people  to 
implement  LOGIST  and  have  not  been  able  to  use  it. 


1 


LOGIST  uses  a maximum  likelihood  estimation  procedure.  It 
computes  all  parameters  at  the  same  time,  using  an  iterative  tech- 
nique. The  iterative  technique  computes  the  first  estimates  from  the 
raw  data.  Then,  those  estimates  become  input  for  the  second  iteration 
of  computation,  using  the  same  maximum  likelihood  procedure  to  compute 
the  second  estimate.  The  second  estimate  becomes  input  for  the  third, 
and  so  on.  The  iterations  continue  until  the  estimates  converge,  and 
do  not  change  significantly  from  one  iteration  to  the  next.  Sometimes 
the  estimates  do  not  converge,  but  drift  off  to  infinity  or  fluctuate 
wildly  back  and  forth.  In  these  cases,  LOGIST  applies  certain  limit- 
ing rules. 

The  a and  b parameters  from  LOGIST  correlate  positively.  This  is 
an  unexpected  and  undesirable  result.  When  c parameters  do  not 
converge,  LOGIST  sets  all  non-converging  c parameters  equal  to  some 
average  value,  usually  between  .10  and  .25.  This  may  occur  with  50% 
to  80%  of  the  items  in  a single  test,  which  suggests  that  the  c 
parameter  is  not  well  estimated  by  LOGIST. 

15.3  OGIVIA*  was  written  for  Dr.  Vern  Urry  of  the  U.S.  Civil  Service 
Commission  (USCSC)  by  Jerry  Edwards,  University  of  Washington  and 
revised  by  John  Gugel  of  the  USCSC.  It  has  also  been  called  URRY 
and  ESTEM  in  the  literature.  There  are  several  versions  of  it,  the 
current  one  being  called  0GIVIA-3.  0GIVIA-3  calculates  and  prints 
both  a classical  item  analysis  and  the  item  parameters.  It  has 
options  for  the  normal  ogive  and  logistic  models,  but  does  not  have 
the  scaling  option  of  LOGIST.  It  does  not  print  out  examinees'  9's, 
but  could  be  made  to  do  so  without  much  trouble. 

*Pronounced  og"'  ve-eye-aye 
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OGI VIA  uses  a Bayesian  modal  estimation  procedure.  It  estimates  item 
parameters,  using  raw  scores  as  an  estimate  of  0,  by  fitting  the  data 
to  a logistic  (or  normal)  ogive.  Chi  Square  is  the  test  for  goodness 
of  fit.  It  then  re-estimates  9 with  the  estimated  item  parameters 
using  an  iterative  technique  until  the  8 estimates 

converge  or  20  iterations  are  done,  whichever  comes  first.  The  re-estimates 
of  0 are  then  used  to  re-estimate  the  item  parameters  by  the  same  curve 
fitting  technique.  Estimates  of  0,  typically,  do  not  converge  on  a small 
percent  (about  1%)  of  examinees;  and  item  parameters  sometimes  do  not 
converge  on  as  many  as  5%  to  10%  of  the  items. 

0G I VI A needs  at  least  1000  examinees  and  60  items  in  the  test, 
with  a test  KR-20  of  +.90  in  order  to  get  good  estimates.  I have 
used  it  with  as  few  as  150  examinees  on  a test  of  30  items  with 
apparent  success  for  my  purposes.  Uses  of  such  small  numbers  should 
be  done  with  caution.  OGI VIA  requires  only  two  to  five  minutes  of 
CPU  time  on  the  computer.  This  fact  makes  OGI V I A much  more  attractive 
than  L0GIST,  despite  the  possible  difficulties  of  Bayesian  estimations. 

An  interesting  feature  is  an  F-ratio  for  each  item,  which  tells 
how  well  the  items  responses  fit  the  model.  An  F-ratio  of  4.00  or 
5.00  or  less  means  the  data  fit  the  model.  In  a comparison  of  the 
F-ratios  of  8 tests  between  the  normal  ogive  model  and  the  logistic 
model,  the  logistic  model  fit  the  data  slightly  better  than  the  normal 
ogive  model. 

Urry  has  a new  version  of  0GIVIA,  called  ANCILLES,  which  needs 
only  30  items,  but  little  is  known  about  it  because  it  is  so  new. 

The  a and  b parameters  of  OGI V IA  do  not  correlate  highly.  However, 
the  c parameter  estimates  fluctuate  considerably  from  sample  to 
sample.  This  indicates  that  OGI VI A,  too,  does  not  estimate  the  c 
parameter  wel 1 . 


Ill 


15.4  Programs  which  print  out  IRFs,  IIFs  and  TICs  are  available  from 
the  USCSC.  These  were  written  by  John  Gugel . 


15.5  All  of  these  computer  programs  are  available  for  the  asking. 
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CHAPTER  16 

Equating  the  9-Scales 

16.1  When  raw  data  is  fed  into  LOGIST  or  OGIVIA,  the  program  cal- 
culates the  item  parameters  (called  "calibrating"  the  item)  and  the 
examinees’  0s  all  at  once.  The  a and  b values  are  on  the  same  scale 
as  9,  which  is  set  to  have  a mean  = 0,  and  standard  deviation  = 1.  If 
the  same  test  is  given  to  two  groups  of  examinees,  and  the  items 
calibrated  separately  for  each  group,  the  a and  b values  from  the 
separate  calibrations  will  not  be  comparable  because  the  scales  of  9 
will  be  on  different  metrics.  And  the  examinees  0s  will  not  be 
comparable  from  group  to  group.  All  that  is  necessary  to  correct  this 
situation  is  to  lump  both  groups  together  and  treat  them  as  one  group. 
Then  the  9s  for  all  examinees  will  be  on  the  same  scale. 

16.2  Another  (more  laborious)  method  is  to  transform  the  9-scale  for 
one  group  to  the  0-scale  of  the  other  group.  This  transformation  is 
possible  because  the  b-value  of  an  item  is  invariant  (except  for 
linear  transformations  of  the  9-scale),  and  because  the  0-scales  are 
linearly  related. 


The  linear  relationship  is  based  on  the  traditional 

zation  formula  b b b b 

2 - 2 _ I - I 

SDk  = SDh 

b2  b, 


standardi- 

Ef.  K.it t 


where  b^ 


the  item  b-value  on  the  metric  of  Group  1, 


b^  = the  item  b-value  on  the  metric  of  Group  2, 

b1  & SDb 1 = the  mean  and  standard  deviation  of 

the  b-values  of  the  items  on  metric 
of  Group  1, 

b2  & SDb  = the  mean  and  standard  deviation  of 

the  b-values  of  the  items  on  metric 
of  Group  2. 
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Lord  (in  preparation)  recommends  that,  when  calculating  the 
b and  SDb>  items  with  low  a values  (a  <.8)  and  extreme  b-values 
<M>  2.0)  be  excluded,  because  b-values  for  such  items  are  not  well 
estimated. 


Solving  equation  16.2a  for  b^. 


CO 

cP 

i 

b. = 

so. 

L b2j 

b2+ 

V 

3* 

o 

to 

b2 

Since  b is  on  the  9-scale,  9 may  be  substituted  for  b,  and 


9, 


V 


(b.2>c 


This  equation  may  be  used  to  transform  an  examinee's  9 score  from  one 
scale  to  another.  It  is  NOT  proper  to  use  a regression  equation  based 
on  the  correlation  of  the  b-values  or  9 from  the  two  groups  of  exam- 
inees for  this  purpose. 

16.3  Suppose  from  a bank  of  100  items  we  construct  two  tests  con- 
taining the  following  items: 

Test  Item  Bank  #s 

A 1-60 

B 41-100 

Each  test  has  60  items,  20  of  which  are  common  to  both  tests.  Suppose 
further,  we  give  the  tests  to  two  groups  of  examinees  (one  test  to 
each  group),  and  calibrate  the  tests  separately. 
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We  now  want  to  put  all  100  items  on  the  same  scale.  We  can  do  so 
because  we  have  20  common  items  on  the  two  tests  and  we  have  those 
items  calibrated  on  both  scales.  First,  calculate  the  mean  and  standard 
deviation  of  the  b-values  on  each  scale  of  the  20  common  items.  Then 
all  the  b-values  of  one  test  may  be  converted  to  the  metric  of  the  other 
test  with  Equation  16.2b.  The  9s  may  be  converted  with  Equation  16.2c. 

The  a-values  are  converted  by  dividing  by  the  ratio  of  the  b-value 
standard  deviations. 


a 


SDbi  _ o , SDba 
S Db2  * SDbi 


NOTE:  Remember,  a2  and  b2  means  the  a and  b values  on  the  old  scale 
and  aj  and  b1  are  the  item's  a and  b values  on  the  new  scale. 

The  c-values  are  already  on  the  same  scale,  because  they  are  on  the 
P(0)  axis  of  the  IRF.  Thus,  all  c-values  are  always  on  the  same  scale 
and  need  no  conversion. 

In  order  to  build  a large  bank  of  calibrated  items  on  the  same 
scale,  it  is  desirable  to  include  in  the  test  items  from  the  bank  which 
have  already  been  calibrated  along  with  new  items.  These  items,  which 
are  used  to  link  one  9-scale  to  another  9-scale,  are  called  "anchor 

items."  A minimum  of  17  anchor  items  is  recommended.  More  than  17 
is  desirable. 
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16.4  Occasionally  the  situation  will  arise  where  two  different  tests 
(i.e.,  no  common  items)  are  given  to  two  groups  of  examinees  at 
different  times,  and  some  of  the  examinees  take  both  tests  (called 
"anchor  persons").  This  situation  may  be  handled  in  either  of  two 
ways. 

(1)  If  there  are  enough  persons,  combine  the  answer  sheets 

of  the  anchor  persons  for  both  tests,  and  treat  the  two  tests  as  one 
long  test.  Then  the  6-scales  of  the  two  separate  tests  may  be 
rescaled  to  the  combined  test  9-scale  as  described  in  Section  16.2 

above. 

(2)  Calibrate  each  test  separately.  Take  the  two  0 values  of 
the  anchor  persons  and  calculate  their  means  and  standard  deviations 
for  each  test.  Then  use: 


to  rescale  all  examinee's  9s, 


to  rescale  the  b-values,  and  to  rescale  a-values: 


Again,  the  c-values  are  already  on  the  same  scale. 

The  use  of  the  9s  to  rescale  assumes  that  0 has  not  changed  between 
administrations  of  the  two  tests. 
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16.5  Even  if  no  anchor  persons  have  taken  both  tests,  it  still  may  be 
possible  to  take  items  from  both  tests,  create  a third  test,  administer 
it  to  a third  group  of  examinees,  and  then  use  the  third  test  as  anchor 
items  to  link  the  original  2 tests. 

16.6  Anchor  items  should  be  chosen  in  order  to  make  the  estimate 

of  SDb  as  accurate  as  possible.  All  estimates  of  b-values  have  some 
error  in  them.  To  reduce  the  proportionate  contribution  of  estimation 

error  to  SD^  we  want  the  SD^  as  large  as  possible.  That  conclusion 
suggests  that  the  anchor  item  b-values  should  have  a bimodal  distri- 
bution. That  is,  half  of  the  anchor  items  should  have  high  b-values, 
and  half  should  have  low  b-values. 

However,  very  high  and  very  low  b-values  are  not  estimated  well, 
which  means  they  have  a significant  amount  of  error  in  them.  We 
should,  therefore,  compromise  between  a large  SDb  and  large  error 
in  estimates  of  b.  This  reasoning  suggests  that  anchor  items  should 
be  bimodal ly  distributed  with  half  the  items  having  moderately  high 
b-values,  say  1 6z  b 1.5,  and  half  with  moderately  low  b-values, 
i.e.,  -1.5  ^ b -6-1.0. 

The  b-values  are  with  respect  to  the  groups  to  be  tested.  If 
there  is  good  reason  to  believe  that  the  group  to  be  tested  has  about 
the  same  distribution  of  ability  as  those  on  whom  the  anchor  items 
were  calibrated,  then  the  bimodally  distributed  b-values  are  the  best 
anchor  items.  However,  if  the  group  turns  out  not  to  have  about  the 
same  distribution  of  ability  as  the  calibration  group,  half  the  anchor 
items  may  be  either  too  easy  or  too  hard,  and  the  anchor  items  will 
not  serve  their  purpose  well. 
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A safer  method  would  be  to  select  anchor  items  to  have  a rec- 
tangular distribution  of  b-values  from  -1.5  to  1.5.  In  this  way 
you  will  be  confident  of  getting  many  anchor  items  of  appropriate 
difficulty  and  still  have  a large  SDb. 


The  a-values  of  anchor  items  should  be  as  large  as  possible, 
and  the  c-values  as  small  as  possible.  However,  the  a-value  is  more 


CHAPTER  17 
Tailored  Testing 


17.1  Psychometrists  long  have  known  and  deplored  the  fact  that  many 
items  on  a test  are  not  appropriate  for  a given  examinee,  i.e.  they 
are  either  too  hard  or  too  easy.  Until  IRT  there  was  no  satisfactory 
way  to  avoid  this  problem,  and  at  the  same  time  get  a decent  measure 
along  the  ability  scale. 

With  IRT  came  the  possibility  of  tailored  testing,  which  is  so 
named  because  it  allows  the  "tailoring"  of  the  test  to  the  ability  of 
the  examinee.  Tailored  testing  is  also  called  adaptive  testing. 
Variations  of  it  are  called  stradaptive  testing  and  flexilevel  test- 
' i ng . 

17.2  Tailored  tests  are  administered  by  a computer  with  the  items 
presented  on  a CRT  (Cathode  Ray  Tube  device,  which  is  similar  to  a 
television  set).  (See  Ree,  1977a.)  It  works  like  this: 

(1)  The  examinee  sits  in  front  of  a CRT  attached  to  a typewriter 
keyboard. 

(2)  The  examinee  registers  on  the  computer  with  his  identi- 
fication, test  name  and  other  pertinent  information. 

(3)  In  the  computer  are  stored  a bank  of  150  to  200,  or  more, 
precalibrated  items  along  with  their  item  parameters.  The  computer 
selects  an  item  of  average  difficulty  and  presents  the  item  to  the 
examinee  or  the  CRT. 

(4)  The  examinee  records  his  answer  on  the  typewriter  keyboard. 

(5)  The  computer  uses  the  examinee's  response  and  the  item 
parameters  to  estimate  the  examinee's  most  likely  9,  and  then  selects 
another  item.  The  item  selected  is  the  one  which  will  best  help  the 
computer  estimate  9 after  the  examinee  answers  the  item.  If  the 
examinee  got  the  item  correct,  he  will  get  a different  next  item  than 
if  he  got  the  item  wrong. 
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(6)  Steps  (4)  and  (5)  above  are  repeated  until  the  computer 
meets  the  criterion  for  stopping  the  test.  The  criterion  for  stopping 
the  test  is  called  the  "stopping  rule." 

17.3  Examinees  with  different  response  patterns  will,  in  general, 
get  a different  set  of  items;  yet  their  final  estimates  will  be  on 
the  same  metric.  Not  all  examinees  may  get  the  same  number  of  items, 
yet  all  0 estimates  can  be  to  the  same  degree  of  accuracy. 

17.4  Stopping  rules  can  be  designed  as  desired  to  fit  the  situation. 
Three  typical  stopping  rules  are: 

(1)  Stop  when  a specified  number  of  items  have  been  administered. 

(2)  Stop  when  the  SEE  of  the  examinee's  0 has  dropped  below 
a specified  value;  often  SEE  £.0625  is  used. 

(3)  Stop  when  no  more  items  remain  in  the  bank  that  will  provide 
a significant  amount  of  information  about  the  examinee. 

It  is  not  uncommon  to  combine  some  of  tne  above  rules. 

17.5  Because  the  computer  selects  items  on  the  basis  of  item  informa- 
tion, the  computer  will  usually  select  items  with  high  a-values  first, 
and  then  after  high  a-value  items  have  been  exhausted,  select  other 

i terns . 

17.6  The  reader  will  recall  from  Section  12.8  that  the  maximum  likeli- 
hood method  (MLE)  will  estimate  0 at  plus  or  minus  infinity  until  the 
examinee  gets  one  item  wrong  and  one  item  correct.  Therefore,  if  the 
examinee  gets  the  first  item  correct,  the  computer  will  give  the 
hardest  item  in  the  bank  second.  And  it  will  continue  to  give  the 
hardest  items  until  the  examinee  gets  one  wrong. 


Similarly,  if  the  examinee  gets  the  first  item  wrong,  the  com- 
puter will  give  the  easiest  items  until  the  examinee  gets  one  correct. 
Since  Bayesian  estimation  methods  (see  Chapter  13)  do  not  have  this 
characteristic,  it  has  been  proposed  to  combine  the  two  estimation 
methods,  using  Bayesian  estimation  until  the  examinee  has  gotten  one 
item  wrong  and  one  item  correct;  and  then  switch  to  MLE. 

Owen  (1975)  has  developed  a highly  efficient  algorithm  for 
Bayesian  scoring.  The  Owen's  Bayesian  scoring  procedure  is  widely 
used. 

17.7  Tailored  testing  has  several  advantages  over  conventional  tests. 

(1)  Depending  upon  the  characteristics  of  the  item  bank,  a 
tailored  test  will  use  only  10%  to  50%  of  the  number  of  items  required 
by  a conventional  test  and  at  the  same  time  will  measure  more  accurate- 
ly than  the  conventional  test  at  almost  all  values  of  9.  Tailored 
tests  can  measure  to  any  specified  degree  of  accuracy. 

(2)  A tailored  test  takes  much  less  time  to  administer,  or 
several  abilities  can  be  measured  by  a tailored  test  in  the  same  time 
needed  to  measure  one  ability  by  a conventional  test. 

(3)  Security  of  the  items  is  much  improved,  because  different 
examinees  get  different  items,  and  because  the  items  are  much  less 
accessible  (in  the  computer  as  opposed  to  hard  copy). 

17.8  The  use  of  tailored  testing  also  has  some  problems. 

The  cost  of  large  scale  use  of  tailored  testing  machines  is 
currently  prohibitive  because  of  the  cost  of  CRT  devices,  an  on-line 
time-sharing  computer,  and  telephone  lines  to  hook  the  CRT  devices 
to  the  computer.  Moreover,  it  often  takes  5 seconds  for  the  computer 
to  do  its  calculations  and  present  the  next  item.  If  only  20  CRT 
devices  were  on  line  at  a time,  the  delay  to  get  the  next  item  could 
be  100  seconds  or  almost  two  minutes.  Such  delay  would  wipe  out 
the  advantage  of  reduced  administration  time  that  makes  tailored 
testing  attractive.  The  reliability  of  telephone  land  lines  is  also 
often  a problem. 
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A feasible  alternative  would  be  a self-contained  tailored  testing 
machine  with  the  items  presented  on  a Microfiche,  and  the  calculations 
done  by  a microprocessor. 

Major  Brian  Waters,  USAF,  has  developed  a prototype  of  such  a 
machine,  which  could  be  mass-produced  for  about  $500  each.  His  design 
requires  the  examinee  to  find  the  item  on  the  Microfiche  film.  The 
microprocessor  "senses"  the  location  of  the  film  and  will  not  accept 
a response  from  the  examinee  if  he  is  looking  at  the  wrong  item. 

Waters  estimates  that  the  microprocessor  could  be  made  to  control 
the  Microfiche  machine  (i.e.  present  the  item  automatically)  for  a 
mass-produced  cost  of  $1500  each. 

Another  potential  problem  is  that  of  legal  defensibi 1 i ty. 

Imagine  an  examinee  who,  after  talking  to  another  examinee,  finds  out 
that  his  items  were  different,  he  got  a different  number  of  items,  he 
got  more  items  correct,  and  yet  got  a lower  score  on  the  test.  This 
situation  contains  all  the  necessary  elements  for  a law  suit.  Now 
imagine  trying  to  explain  to  a judge  or  jury  that,  in  fact,  the 
examinee  was  not  improperly  discriminated  against,  and  that  the 
Bayesian  modal  or  maximum  likelihood  method  of  estimating  theta  was 
more  accurate.  Also  consider  the  hundreds  of  so-called  "testing 
experts"  across  the  country  who  have  never  heard  of  item  response 
theory  and  who  might  be  called  to  testify.  You  may  now  have  an 
inkling  of  the  enormous  problems  ahead  for  what  Lord  calls  "occult 
scoring  methods." 

17.9  Nevertheless,  work  is  progressing  toward  the  use  of  tailored 
testing.  The  U.S.  Civil  Service  Commission  has  adopted  the  use  of 
tailored  testing  as  a matter  of  policy.  The  U.S.  Air  Force  Human 


122 


* 


w " ! 

Resources  Laboratory,  San  Antonio,  Texas,  has  a tailored  testing 

machine  operating  on  an  experimental  basis  at  the  San  Antonio  AFEES 

(Armed  Forces  Entrance  Examination  Station).  (Ree,  1977a)  Several  studies  of 

live  tailored  testing  have  been  published  by  the  Psychometric  Methods 

Program  at  the  University  of  Minnesota.  The  Educational  Testing 

Service  is  also  considering  tailored  testing  and  intends  to 

engineer  its  own  tailored  testing  machine. 

17.10  Obtaining  a large  bank  of  calibrated  items  is  not  a simple 
matter.  As  a result,  the  military  services  have  formed  an  Ad  Hoc 
Group  on  Adaptive  Testing.  One  of  its  purposes  is  to  share  cali- 
brated items.  It  has  become  evident  that  even  the  sharing  of  the 

items  gets  complicated.  (See  Ree,  1978.)  Below  is  a list  of  information 
necessary  to  share  items. 


(1)  item  itself  with  key 

(2)  reference  for  the  correct  answer 

(3)  a,  b,  and  c values 

(4)  evidence  of  goodness-of-fi t (i.e.  if  OGIVIA  is  used,  the 
Chi-Square  and  F-ratio). 

(5)  evidence  of  unidimensionality 

(6)  the  name  of  the  dimension 

(7)  the  computer  program  used  to  estimate  parameters 

(8)  normal  ogive  or  logistic  model  (on  OGIVIA,  1st  or  2nd  cycle) 

(9)  description  of  the  sample  on  whom  calibrated,  and  size 

(10)  evidence  of  cultural-fairness 

(11)  description  of  anchor  items  used,  if  any. 
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Once  the  bank  has  been  established,  a method  for  controlling  its 
use  must  be  established.  Most  users  will  want  the  items  with  high 
a-values  and  low  c-values.  If  the  users  are  giving  the  items  to  the 
same  population,  several  users  may  give  the  same  item  to  the  same 
examinees.  Such  over-exposure  can  destroy  the  worth  of  the  item. 


Items  in  the  bank  may  be  duplicates  or  near  duplicates.  Thus, 
careful  visual  inspection  of  the  items  in  the  bank  will  be  required. 

17.11  Such  problems  are  only  a few  of  those  which  will  be  encountered 
as  work  progresses.  A humorous,  actual  example  can  give  an  idea  of 
how  problems  cannot  be  anticipated. 

At  the  San  Antonio  AFEES,  examinees  were  being  tested  on  a 
tailored-testing  machine.  The  CRT  device  was  hooked  to  the  computer 
by  telephone  line.  The  telephone  used  for  the  connection  sat  on  the 
table  beside  the  CRT  device.  On  the  telephone  were  two  buttons 
labeled  "DATA"  and  "TALK." 


One  examinee,  when  left  alone,  pressed  the  "TALK"  button,  break- 
ing the  connection  with  the  computer,  and  called  his  mother  in  Dallas. 
(Ree,  1977b.) 

Murphy's  Law  reigns  supreme. 
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CHAPTER  18 


Item  Cultural  Bias 

18.1  The  study  of  culture-fair  testing  has  become  highly  complex, 
since  the  issue  came  to  public  attention  in  the  late  1960's. 

There  are  at  least  5 statistical  definitions  of  bias  and 
fairness,  and  3 ethical  positions.  This  paper  will  not  try 
to  sort  out  those  matters.  For  some  of  the  more  important  papers; 
see  Cleary,  1969;  Darlington,  1971;  Hunter  & Schmidt,  1976; 
and  Thorndike,  1971. 

Moreover,  the  issue  of  the  practical  effect  of  test  bias  on 
predicting  some  outside  criterion,  such  as  job  performance  or 
college  GPA  is  also  not  of  concern  here,  since  we  have  guaranteed 
construct  validity  by  the  requirement  for  unidimensionality. 

In  this  sense  the  criterion  is  the  examinee's  true  9. 


18.2  Studies  of  item-bias  using  classical  test  theory  often 
compare  the  item,  p- values  for  one  group  with  the  item  p-values 
of  another  group.  Items  with  significantly  different  p-values 
between  the  2 groups  are  thought  to  be  culturally-biased  items. 
Such  an  approach  is  inappropriate  for  at  least  2 reasons.  First, 
the  method  assumes  that  the  two  groups  have  the  same  average 
ability.  That  assumption  is  probably  false  even  if  the  groups 
are  matched  on  moderator  variables  such  as  educational  level, 
since  the  quality  of  education  varies  considerably  from  school 
to  school. 
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Second,  the  comparison  of  p-values  across  groups  assumes  that 
the  bivariate  distribution  of  the  p-values  is  linearly  related. 
Lord  (1977)  gives  several  proofs  that  the  p-values  can  NOT  be 
linearly  related.  One  proof  will  be  described  here. 

Consider  2 items.  Item  one  is  extremely  easy,  and  item  two 
is  extremely  difficult.  Assume  both  items  are  administered  to 
two  different  ethnic  groups,  A & B,  in  a test  with  other  items  of 
intermediate  difficulty.  The  p-values  for  both  groups  for  item 
1 will  be  1.00  because  the  item  is  so  easy.  The  p-values  for 
both  groups  for  item  2 will  be  the  c- value  of  the  item,  if  the 
item  is  so  hard  that  all  members  of  both  groups  have  to  guess  at 
the  answer.  We  can  plot  the  points  represented  by  the  p-values 
of  these  2 items.  (See  Fig.  18.2a). 


Figure  18.2a.  The  bivariate  distribution  of  p-values 
for  two  hypothetical  items,  # 1 and  # 2,  for  two 
hypothetical  groups,  A and  B.  Item  # 1 is  extremely 
easy,  and  item  # 2 is  extremely  difficult  for  both 
groups. 
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In  order  to  be  linearly  related  the  bivariate  distribution  of  p-values 
of  the  other  items  in  the  test  between  groups  must  fall  around  the 
straight  dashed  line  in  Fig.  18.2a,  connecting  the  points  for 
items  1 and  2.  However,  if  group  A does  better  as  a whole  on  the 
test  than  group  B,  the  p-value  points  for  many  items  of  intermediate 
difficulty  will  fall  to  one  side  of  the  straight  line.  (See  Fig. 
18.2b). 


Figure  18.2b.  The  bivariate  distribution  of  p-values 
of  items  in  a hypothetical  test  on  which  Group  A does 
better  than  Group  B. 
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The  line  of  best  fit  then  must  pass  through  the  points  for  items 
1 and  2,  and  through  the  middle  of  the  bivariate  distribution  of  the 
p-values  for  other  items  in  the  test.  That  line  must  be  curved  to 
do  so,  as  is  the  solid  line  in  Fig.  18.2b.  Therefore,  the  relation- 
ship between  p-values  cannot  be  linear.  The  same  is  true  of  other 
classical  item  parameters,  such  as  the  "corrected"  p-value,  the 
inverse  normal  transformation  of  the  p-value,  and  "delta"  (Lord 
and  Novick,  1968,  p.  381). 

For  this  reason  (and  others)  the  use  of  classical  test  theory 
item  parameters  is  inappropriate  for,  and  can  lead  to  erroneous 
identification  of  item  bias. 

18.3  The  a,b,  and  c item  parameters  are  invariant  across  groups 
under  the  assumption  of  unidimensionality.  Therefore,  it  should 
not  matter  in  principle  on  what  group  the  item  is  calibrated. 

Whatever  the  group,  the  a,  b,  and  c values  should  be  the  same 
except  for  some  random  estimate  error,  once  the  a and  b values 
are  on  the  same  metric. 

If  the  a and/or  b and/or  c values  are  significantly  different, 
when  calibrated  separately  on  two  groups  (and  put  on  the  same  metric), 
it  means  that  examinees  with  identical  0s  will  have  different 
chances  of  getting  the  item  correct  (P(0)),  depending  on  their 
group.  That  situation  is  clearly  unfair.  Thus,  we  may  define 
bias  between  groups  A & B as 


Of  course,  if  P^(0) ^ Pg(9)  for  one  6-value,  then  they  will 
not  be  equal  for  other  0-values.  However,  it  is  not  true  that  they 
must  be  unequal  for  all  0-values.  It  is  quite  possible  for  an  item 
to  be  biased  at,  say,  high  0 and  not  biased  at  low  0.  This 
possibility  stems  from  the  fact  that  the  P(0)'s  may  be  different 
due  to  any  one  or  more  of  the  3 item  parameters  being  different. 

18.4  If  P^(0) ^ Pg(0),  then  the  item  can  be  used  to  distinguish 
between  groups  A & B,  even  in  the  unusual  circumstance  of  all 
examinees  in  both  groups  having  identical  0s.  This  distinction 
means  that  the  interitem  correlation,  given  0 is  not  zero. 

rfd|e*o 

But  rfJ£  = 0 is  a requirement  for  local  independence 
(see  Sec.  14.3),  and  local  independence  is  a necessary  (but  not 
sufficient)  condition  for  unidimensionality  (see  Sec.  14.7). 
Therefore,  if  P^(0)^Pg(0),  the  test  is  not  unidimensional 
with  respect  to  groups  A and  B.  The  problem  of  item  bias,  then, 
is  one  of  violation  of  the  assumptions  of  local  independence  and 
uni  dimensionality  with  respect  to  the  groups  of  examinees.  I 
hereby  name  this  condition  "group  dimensionality". 


18.5  If  many  of  the  items  are  group  dimensional,  that  condition 
may  be  detected  by  the  Congruence  Test  (see  Sec.  14.6(5)).  If 
only  a few  items  are  biased,  the  Congruence  Test  may  not  be  sen- 
sitive enough  to  detect  them.  In  any  case  we  still  may  wish  to 
know  just  how  a particular  item  is  biased,  and  what  relative 
effect  that  bias  has  on  the  groups  of  examinees. 

18.6  One  method  used  to  make  this  determination  is  the  comparison 
of  the  IRF's  of  the  item  for  each  of  the  groups.  Figures  18.6a 

to  18. 6e  show  the  IRFs  of  actual  items  from  an  experimental  form 
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of  the  Coast  Guard  Selection  Test  (CGST-X1),  when  calibrated 
separately  for  Blacks  and  Whites.  From  Fig.  18.6a  we  can  see 
that  the  P (9 ) of  item  #32  is  greater  for  Blacks  than  Whites  at 
every  value  of  0,  although  for  0 greater  than  0.0  and  less  than 
-4.0,  they  are  not  much  different.  Thus,  this  item  is  biased 
in  favor  of  Blacks  for  0 between  -4.0  and  0.0,  and  fair  for 
both  groups  elsewhere  on  the  0-scale.  Since  the  range  of  0 
where  this  item  is  biased  coincides  very  closely  with  the  range 
of  0 where  the  item  provides  most  of  its  information,  the  item 
is  not  both  fair  and  useful  anywhere,  and  thus  is  a bad  item. 

Figure  18.6b  shows  the  IRF's  for  a similar  item,  but  which 
is  biased  in  favor  of  Whites  instead.  This  item  is  biased  for 
0 = -0.6  to  0 = -2.4  and  for  0<-3.2.  The  item  is  not  significantly 
biased  from  0 = -2.4  to  0 = -3.2  and  for  9>-0.6. 

Figure  18.6c  shows  the  IRF's  of  an  item  that  appears  fair 
or  nearly  fair  at  all  0 in  spite  of  a rather  substantial  difference 
in  the  a-values  for  the  two  groups. 

The  item  in  Figure  18. 6d  is  biased  in  favor  of  Blacks  for  0 
from  0.4  to  2.0,  but  biased  in  favor  of  Whites  for  all  9 less 
than  0.0. 

The  item  in  Figure  18. 6e  is  Black-biased  at  0<-l.O,  and 
White-biased  for  -0.8<9<1.0. 

18.7  In  my  interpretations  of  the  figures  in  Section  18.6  I have 
tended  to  disregard  small  differences  between  the  IRF's  which 
appear  insignificant  on  the  graphs.  Actually,  it  is  not  yet 
known  how  much  of  a difference  between  the  IRFs  of  group  di- 
mensional items  make  a significant  difference  in  the  estimation 
of  0.  Such  a determination  would  depend  upon  the  distributions 
of  9 for  the  two  groups,  which  as  I indicated  in  Section  18.2  cannot 
be  expected  to  be  the  same. 
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18.8  In  actual  practice  the  item  parameters  are  usually  estimated 
with  both  groups  lumped  together  in  the  sample.  When  this  combined- 
group  calibration  is  done,  the  resulting  item  parameters  are  seme 
complicated  (not  a simple  or  even  weighted)  average  of  the  separate- 
group  calibrated  parameters.  However,  as  a rough  rule  of  thumb 
the  I RF  of  the  combined-group  falls  generally  between  the  IRF's 
of  the  separate  groups.  If  the  combined-group  parameters  are 
then  used  to  estimate  9s  for  both  groups,  the  result  will  be  a 
nor.-systematic  distortion  of  the  9s  for  both  groups. 

18.9  Attempts  have  been  made  to  develop  sample-free  indicators  of 
item  bias. 

One  way  is  to  plot  the  item  parameters  for  the  2 groups  as 
I have  done  for  each  of  the  parameters  for  Blacks  and  Whites  from 
CGST-X1  in  Figures  18.9a,  18.9b,  and  18.9c.  The  solid  lines  show 
the  theoretical,  expected  line  of  best  fit,  assuming  unidimensionality. 

The  dashed  lines  were  rather  arbitrarily  chosen  to  exemplify  an 
acceptance  region.  If  the  dashed  lines  were  chosen  statistically, 
items  whose  item  numbers  lie  outside  the  region  would  have  statis- 
tically, significantly  different  item  parameter (s ) . Another  method 
is  used  by  Lord  (1977).  He  divided  his  total  sample  into  2 
random  groups,  and  by  conducting  separate  calibrations  on  each 
random  group,  constructed  his  own  empirical  test  of  signi ficance. 

After  identifying  biased  items,  Lord  then  repeated  the  entire  5-step 
procedure  to  eliminate  the  contamination  of  the  biased  items.  His 
procedures  contemplates  the  possibility  that  the  presence  of  strongly 
biased  items  may  mask  the  moderate  bias  of  other  items. 

Rudner  (1977)  has  compared  4 different  methods  of  bias  detect- 
ion, the  best  of  which  appears  to  be  the  calculation  of  the  area 

between  the  two  IRFs.  , 
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separate  calibrations  for  Blacks  and  Whites  of 


Both  methods  (Lord's  and  Rudner's)  assume  that  the  c-values 
for  both  groups  are  the  same,  an  assumption  which  both  Lord  and 
Rudner  would  agree  is  false. 

Lord  (in  preparation,  pp.  296-298)  provides  true  statistical 
difference  tests  for  maximum  likelihood  estimates  of  a-values 
and  b-values  (but  not  c-values). 

18.10  What  makes  an  item  culturally-biased  (i.e.,  group  dimen- 
sional )? 

In  1968  I conducted  research  to  try  to  answer  this  question 
(using  classical  test  theory).  I made  two  lists  of  20  items 
each.  One  list  contained  only  Black-biased  items,  and  the  other 
only  White-biased  items.  Even  with  intense  study  of  the  two  lists 
neither  I nor  Black  testing  practitioners  to  whom  I showed  the 
lists  could  come  up  with  any  consistent  set  of  hypotheses  to 
explain  the  bias.  Ky  conclusion  was  that  item-bias  may  not  be 
identified  by  inspection. 

More  recent  investigations  offer  some  hope.  Lord  (1977) 
found  a reading  comprehension  item  on  Black  history  in  the  U.S. 
to  be  Black-biased  at  all  values  of  9.  Durovic  (1978)  found  that 
two  minority  reviewers  had  strong  negative  reactions  to  the  two 
items  out  of  14  that  failed  the  Rasch  model  test  of  fit.  Scheuneman 
(1976)  found  that  her  chi-square  procedure  identified  items  as 
biased,  which  contained  content  readily  interpreted  as  culturally- 
biased  from  a common  sense  point  of  view. 

: 
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These  results  suggest  that  at  least  some  biased  items  may 
be  identifiable  by  inspecting  their  content.  Nevertheless,  other 
item-bias  seems  inexplicable  by  content,  such  as  the  following 
two  vocabulary  items  from  the  Scholastic  Aptitude  Test.  Both  items 
seek  the  OPPOSITE  of  the  stem  word. 

2.  INJURE  (quoted  from  Lord,  1977,  p.  29) 

A.  release 

B.  refrain 

C.  smooth 

D.  embellish 
* E.  heal 

8.  GEL  (quoted  from  Lord,  in  preparation,  p.  294) 

A.  glaze 

B.  debase 

C.  corrode 
*D.  melt 

E.  infect 

Both  items  are  Black-biased  (i.e.,  in  favor  of  Blacks)  at  low 
0 and  White-biased  at  high  9. 


18. 11  Testing  practitioners  must  often  make  practical  decisions 
even  in  the  total  absence  of  relevant  information.  Because  I am 
such  a testing  practitioner,  and  because  of  r.y  penchant  for  rules 
of  thumb,  I offer  the  following  guidelines  for  rejecting  items 
as  biased  between  groups  A and  B.  Let 

d(a),  d(b),  d(c)  = the  absolute  values  of  the  difference  of  the  a, 
b,  and  c values,  respectively,  for  the  two  groups  (after  being 
converted  to  the  same  scale). 

Then,  I declare  as  biased  any  item  which  meets  any  one  or  more  of 
the  four  following  conditions: 

(1)  d(a)>  .80 

(2)  d(b)>  .50 

(3)  d(c)>.15 

(4)  d(a)  + d(b)>  1.00 

The  dotted  lines  in  Figures  18.9a,  18.9b,  and  18.9c  reflect  the 
first  three  of  these  criteria.  There  are  so  many  legitimate 
objections  to  these  rules  of  thumb  that  I shall  not  try  to  justify 
them.  I developed  them  merely  by  looking  at  ny  data  and  trying 
to  come  up  with  something  usable  and  plausible.  Perhaps  the 
outrageousness  of  my  suggestion  will  motivate  the  research  necessary 
to  develop  truly  scientific  criteria.  In  the  meantime  practitioners 
must  practice. 
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18.12  As  a slight  digression  I feel  compelled  to  mention  a study 
reported  by  Weiss  (1975).  The  study  was  only  a small  part  of  the 
cited  reference  (pp.  33-35),  and  I shall  discuss  only  a portion  of 
the  results.  Furthermore,  Weiss  is  very  cautious  in  his  interpre- 
tion.  Nevertheless,  the  potential  implications  of  the  results,  if 
replicable,  are  of  such  tremendous  import  to  the  field  of  culture- 
fair  testing  that  I feel  all  testing  practi tioners  should  be  aware 
of  them. 

Weiss  investigated  (among  other  things)  the  effect  of  immediate 
feedback  on  test  score.  He  administered  a conventional  multiple- 
choice  test  to  Black  and  White  high  school  students  with  the  items 
presented  by  computer  on  a CRT.  (This  was  not  tailored  testing. 

All  examinees  received  the  sane  items.) 

Half  of  each  group  (Black  and  White)  received  immediate  feedback 
from  the  computer  after  each  response,  indicating  whether  or  not 
the  examinee  got  the  item  correct.  The  other  half  of  each  group 
received  no  feedback. 

Feedback  was  in  the  form  of  one  of  six  statements  used  in  a 
pseudorandom  order,  such  as  "right  on",  "that's  cool,  now  try  this 
one",  and  "all  right,  how  about  this  one".  The  six  statements  were 
selected  from  those  suggested  by  other  students  at  the  same  high 
school  in  order  to  make  the  feedback  meaningful  to  the  examinees. 

With  no  feedback  Blacks  scored  much  worse  than  Whites,  an 
unfortunate  result  that  has  been  frequently  observed. 
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However,  under  the  feedback  condition  Blacks  did  as  well  as 
(actually  slightly  better  than)  Whites.  Further  analysis  of  the 
data  showed  that  without  feedback  Blacks  skipped  (left  unanswered) 
more  items  than  Whites.  But  with  feedback  the  Blacks  skipped  almost 
no  items. 

These  results  suggest  that  differences  in  observed  test  scores 
between  groups  may  be  due  to  motivational  variables,  such  as  a 
need  for  encouragement  on  the  part  of  Blacks,  and  that,  when 
received.  Blacks  score  as  well  as  Whites. 

If  these  results  prove  to  be  replicable,  the  use  of  testing 
machines  with  appropriate  feedback  could  resolve  a large  part  of 
the  culture- fair  testing  controversy. 
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CHAPTER  19 

Setting  Minimum  Passing  (Cut-Off)  Scores 


19.1  One  of  the  more  common  uses  of  testing  is  the  classification  of 
examinees  into  two  or  more  categories.  For  instance,  a college 
entrance  examination  may  be  used  to  classify  into  acceptable  vs. 
nonacceptable  categories,  or  remedial  program  vs.  regular  program 

vs.  advanced  placement  categories.  A job  knowledge  test  may  be  used 
to  classify  applicants  into  hire  vs.  don't  hire,  or  promote  vs.  don't 
promote  categories.  Each  of  these  examples  is  one  of  "classifica- 
tion." The  examinees  are  being  "classified"  into  discrete  categories.  * 

The  classical  methodology  is  to  conduct  a validation  study  in 
which  large  numbers  of  persons  are  tested  and  measured  on  the  criter- 
ion. Then,  making  the  dubious  assumption  of  a linear  relationship 
between  test  score  and  criterion  measure,  the  criterion  measure  is 
predicted  from  the  test  scores.  The  predictive  validity  study,  the 
ideal,  is  almost  always  extremely  expensive  and  usually  impossible  in 
practice.  Its  less  satisfactory  alternative,  the  concurrent  validity 
study,  is  also  usually  expensive,  and  often  fraught  with  problems. 

19.2  There  are  two  exciting,  inexpensive  alternatives  to  this  im- 
portant and  most  troublesome  psychometric  problem,  which  I shall 
briefly  describe  with  variations,  combining  them  with  IRT.  Both 
techniques  are  simple  to  use  and  rather  ingenious. 

19.3  Livingston  (1976)  described  a method  of  finding  a criterion- 
referenced  cut-score,  which  requires  only  a few  criterion  measurements 

The  Livingston  method  follows: 

(1)  Give  the  test  to  a group  of  examinees. 

(2)  Pick  an  examinee  with  an  average  test  score  and  measure 

his  performance  on  the  criterion. 
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(3)  If  he  is  competent  (satisfactory)  on  the  criterion,  pick 
another  examinee  with  a lower  test  score.  If  he  is  incompetent 
(unsatisfactory)  on  the  criterion,  pick  an  examinee  with  a 
higher  score. 

(4)  Repeat  step  (3)  over  and  over,  each  time  reducing  the 
difference  between  the  last  test  score  and  the  next  one. 

(Livingston  gives  several  methods  of  minimizing  the  number 
of  criterion  measurements  required).  With  each  repetition 
of  step  (3),  the  range  of  uncertainty  of  the  test  score  that 
corresponds  to  the  level  of  minimum  competence  will  be 
diminished.  When  you  have  "zeroed-in"  on  the  minimum  test 
score  with  sufficient  accuracy,  you  can  stop. 

Livingston's  technique  has  two  significant  limitations.  First, 
it  uses  the  number-right  score  as  the  predictor.  As  we  have  seen 
in  Sec.  12.9,  the  number-right  score  can  correspond  to  a wide  range 
of  9,  unless  the  test  happens  to  have  high  information  at  the 
cut-off  9.  Since  the  cut-off  9 is  not  known  at  the  beginning  of 
the  technique,  finding  high  information  at  the  ultimately-determined 
cut-off  9 would  be  pure  luck.  Selecting  examinees  on  the  basis  of 
their  9 estimates  would  improve  this  method.  (This  would  be  an  ideal 
application  of  tailored  testing.)  Once  the  cut-off  9 is  found,  one 
can  redesign  the  test  to  have  high  information  at  that  9,  and  then  use 
the  corresponding  number-right  score  for  selection. 

Second,  the  technique  seems  to  assume  that  the  criterion-measure  is 
unidimensional,  and  the  same  dimension  as  the  test.  Obtaining  a 
uni  dimensional  criterion  measure  will  be  difficult  in  many  practical 


Both  use  of  the  number-right  score  and  nonunidimensionality  of 
the  criterion  (and/or  the  test)  will  result  in  failure  to  find  a 
sharp  cut-score.  Rather,  the  percent  of  examinees  found  satisfactory 
will  rise  gradually  as  test  score  increases,  locking  much  like  an  IRF. 
The  decision  must  then  be  made  of  an  acceptable  risk  level  of  pro- 
bability of  success  on  the  criterion.  Usually  a 50%  risk  is  used. 

In  some  situations  it  may  be  relatively  easy  to  identify  in- 
itially a group  of  persons  of  marginal  competency  (e.g.  "Supervisor, 
give  me  a list  of  your  barely  acceptable  subordinates.")  If  it  is 
feasible  to  do  so,  one  may  administer  the  test  to  them,  and  find  their 
average  0 (or  average  number-right  score,  if  need  be).  Their  average 
9,  or  score,  would  be  near  the  best  cut-off. 

The  Livingston  technique  of  selecting  examinees  with  higher  or 
lower  9 is  analogous  to  the  item  selection  technique  in  tailored 
testing  of  selecting  harder  or  easier  items.  Hence,  I dub  this 
technique  "tailored  cutting." 


19.4  The  other  cut-score  setting  technique,  called  MAPL,*  (Minimum 
Acceptable  Performance  Level)  was  introduced  by  Nedelsky  (1954). 

One  version  of  the  MAPL  procedure  follows: 

(1)  Assemble  a group  of  six  to  eight  subject-matter  experts 
(SMEs). 

(2)  Instruct  the  SMEs  to  form  a picture  in  their  minds  of  the 
barely  acceptable  person  for  the  job  (or  other  criterion). 


*pronounced  "maple" 

! 

1 

L 
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(3)  Each  SME  then  reads  each  test  item  and  item  dis tractor  and 
asks  himself  the  question,  "Would  the  barely  acceptable  person  know 
that  this  distractor  (wrong  alternative)  is  wrong?" 

(4)  If  his  answer  to  the  question  is 

(a)  definitely  no,  he  assigns  two  points  to  the  distractor. 

(b)  definitely  yes,  he  assigns  0 points  to  the  distractor. 

(c)  neither  definitely  yes  nor  no,  he  assigns  one  point 
to  the  distractor. 

(5)  Two  points  are  always  assigned  to  the  correct  choice  (key). 

(6)  Add  up  the  points  assigned  to  all  the  choices  of  the  item 
(including  the  key)  by  each  of  the  SMEs. 

(7)  Average  the  total  points  assigned  to  the  item  by  the  SMEs. 

(8)  Divide  the  average  total  points  into  two.  The  quotient 
is  called  the  ASI  (Alternative  Similarity  Index). 

ASI  = 24(average  total  points) 


(9)  Add  up  the  ASIs  for  all  the  items  in  the  test.  This  sum  is 
the  MAPL  for  the  test.  The  MAPL  is  the  number-right  score  of 
the  minimally  acceptable  person. 

MAPL  = 2 ASI 

MAPL  is  amazingly  simple  and  highly  effective  in  identifying 
unsatisfactory  individuals  and/or  training  need  areas.  (See  Meredith, 
1977). 

19.5  MAPL  may  be  made  even  simpler  and  perhaps  more  effective  by 
combining  it  with  IRT. 


The  ASI  is,  in  fact,  an  estimate  of  the  P(Q)  of  a person  with 
a barely  acceptable  9.  This  identity  may  be  seen  by  considering 
the  two  extreme  cases.  If  the  SMEs  assign  two  points  to  every 
distractor  in  a four-choice  item,  then  the  ASI  = 2 ~ (2+2+2+2)  = 
2—8  = 1/4  = .25.  The  SMEs  have,  in  effect,  judged  that  the  barely 
acceptable  person  would  not  know  that  any  of  the  distractors  are 
wrong,  and  hence  all  choices  are  equally  attractive.  The  barely 
acceptable  person  then  would  have  to  guess  and,  assuming  he  guesses 
randomly,  would  have  a .25  chance  of  guessing  correctly. 

On  the  other  hand,  if  the  SMEs  assigned  zero  points  to  every 
distractor,  that  means  the  SMEs  judged  that  the  barely  acceptable 
person  will  know  all  the  distractors  are  wrong.  He  will  thus  be 
sure  to  get  the  item  correct  and  the  ASI  = 2 *f-  (2+0+0+0)  = 2~2  = 
1.00. 


If  the  test  is  unidimensional  and  if  the  items  have  been  pre- 
calibrated, then,  using  the  ASI  as  an  estimate  of  P(9),  it  is  an 
easy  matter  to  get  the  9 of  the  barely  acceptable  person  for  that 
item  with  the  following  formula: 


MAPI  6- b 


ASI-?  - 
i-m 


where,  log  means  the  natural  logarithm.  This  formula  is  merely  the 
logistic  formula  for  P(9),  solved  for  9,  and  substituting  ASI  for 
P(9).  MAPI  9 is  the  estimated  9 of  the  minimally  acceptable  person. 


In  the  usual  application  of  MAPL  the  SMEs  must  assign  points 
to  every  distractor  of  every  item  in  the  test.  However,  in  this 
suggested  alternative,  the  SMEs  would  only  have  to  do  a few  items, 
perhaps  10  to  15.  The  MAPL  9 would  then  be  the  average  estimated 
9 of  the  barely  acceptable  person  for  the  10  to  15  items. 

This  melding  of  MAPL  and  I RT  also  has  some  limitations.  MAPL  9 
presumes,  as  does  MAPL,  that  the  SMEs  are  able  to  properly  make  the 
decisions  required  of  them. 

Since  the  ASI  cannot  be  less  than  .25  (for  a four-choice  item), 
it  assumes  that  c = 1/A,  which  we  have  shown  is  often  not  the  case. 
Therefore,  it  would  be  well  to  choose  items  which  have  c^.25. 

This  technique  also  has  a compounding  of  error.  That  is,  the 
ASI  is  an  estimate  of  P(9),  and  the  a,  b,  and  c values  are  estimates 
of  the  true  item  parameters.  When  the  two  estimates  are  combined, 
their  separate  errors  may  be  multiplied.  To  reduce  this  compounding 
of  error,  items  should  be  chosen  which  have  moderately  low  a-values, 
i.e.  about  a = 1.00.  Furthermore,  the  items  used  should  have  a range 
of  b-values  from,  say,  -1.00  to  +1.00. 

MAPL  9 is  untried,  and  should  be  used  with  caution  until  adequate 
research  on  it  can  be  done.  It  is,  therefore,  more  of  a suggestion 
than  a true  alternative. 


POSTWORD 


The  purpose  of  any  communication  is  the  creation  of  understand- 
ing. That  is  my  sole  purpose:  to  create  understanding  of  1RT  in  the 
reader. 

If  there  is  any  part  of  this  publication  that  you  do  not  under- 
stand, then  I have  not  been  completely  successful  in  my  effort. 

Therefore,  I would  sincerely  appreciate  any  comments,  suggest- 
ions, questions,  corrections,  ideas,  or  discussion  about  this  pu- 
blication. Please  feel  free  to  telephone  or  write  to  me  for  further 
explanation,  discussion,  criticism,  or  just  plain  chew  the  fat  about 


IRT. 


THOMAS  A.  WARM,  Chief,  Exam  Branch 
Research  and  Examination  Division 
U.S.  Coast  Guard  Institute 
P.0.  Substation  18 
Oklahoma  City,  OK  73169 

(405)686-2417  --  commercial 
732-2417  — FTS 
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APPENDIX  A 

LOGISTIC  IDENTITIES  & EQUATIONS 


0 * ability  parameter 
a,b,c  ■it«n  parameters 

e*natural  logarithm  base 

l.7*»2B9 
Let  x = l.7a(0-b) 


-x. -I.7a(0-b) 

6 ” C 


^I7ab 

el.7a0 


I e*_ 

i+e_x  = TT?" 

_i 4- 

l+ex  !+• 


(l-t-e*)2  (l+e*)2 

P(0)sc+  — = c--~ 

l+ex  H-ex 


cel7qb  -I-  el7q9 
#L7ab  + eI.Tb0 


o(e)=i-p(9)*  -^r- 

(Ke)  • o(9)  = ibsLlfisS 

[i+.3‘ 


P'(0)=-Q/(0)  = 


l.7a(l-c)ex 

D 7 


l.7b(l-c)fx 

C1  +*x  7 


1(0,  u) 


, _P^  2.89(1  -c)  q1 

P-Q  [c+e^fH-i^2 


p/r(0).  (I.7q)*(  I -c)(l  -cx)  ex 
(l+ex)* 


If  c =*,  I(0,u)=l.7aP'(0)  . 


w(0),-^l LWL 

p(e)o(e)  c + ex 
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